Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Performance improvements in crawling modern Web applications Zarei, Alireza 2014

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2014_spring_zarei_alireza.pdf [ 1005.22kB ]
Metadata
JSON: 24-1.0165885.json
JSON-LD: 24-1.0165885-ld.json
RDF/XML (Pretty): 24-1.0165885-rdf.xml
RDF/JSON: 24-1.0165885-rdf.json
Turtle: 24-1.0165885-turtle.txt
N-Triples: 24-1.0165885-rdf-ntriples.txt
Original Record: 24-1.0165885-source.json
Full Text
24-1.0165885-fulltext.txt
Citation
24-1.0165885.ris

Full Text

Performance Improvements inCrawling Modern Web ApplicationsbyAlireza ZareiB.Sc., Amirkabir University of Technology (Tehran Polytechnic), 2010A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF APPLIED SCIENCEinThe Faculty of Graduate and Postdoctoral Studies(Electrical and Computer Engineering)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)February 2014c? Alireza Zarei 2014AbstractToday, a considerable portion of our society relies on Web applications toperform numerous tasks in everyday life; for example, transferring moneyover wire or purchasing flight tickets. To ascertain such pervasive Webapplications perform robustly, various tools are introduced in the softwareengineering research community and the industry. Web application crawlersare an instance of such tools used in testing and analysis of Web applica-tions. Software testing, and in particular testing Web applications, play animperative role in ensuring the quality and reliability of software systems. Inthis thesis, we aim at optimizing the crawling of modern Web applicationsin terms of memory and time performances.Modern Web applications are event driven and have dynamic states incontrast to classic Web applications. Aiming at improving the crawling pro-cess of modern Web applications, we focus on state transition managementand scalability of the crawling process. To improve the time performance ofthe state transition management mechanism, we propose three alternativetechniques revised incrementally. In addition, aiming at increasing the statecoverage, i.e. increasing the number of states crawled in a Web application,we propose an alternative solution, reducing the memory consumption, forstorage and retrieval of dynamic states in Web applications. Moreover, amemory analysis is performed by using memory profiling tools to investigatethe areas of memory performance optimization.The enhancements proposed are able to improve the time performanceof the state transition management by 253.34%. That is, the time consump-tion of the default state transition management is 3.53 times the proposedsolution time, which in turn means time consumption is reduced by 71.69%.Moreover, the scalability of the crawling process is improved by 88.16%.That is, the proposed solution covers a considerably greater number of statesin crawling Web applications. Finally, we identified the bottlenecks of scal-ability so as to be addressed in future work.iiPrefaceThis thesis is original, independent work and it is designed, carried out, andanalyzed by the author, Alireza Zarei.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1 Web Applications . . . . . . . . . . . . . . . . . . . . . . . . 42.1.1 HTML and DOM . . . . . . . . . . . . . . . . . . . . 52.1.2 Modern vs. Classic Web Applications . . . . . . . . 62.2 Crawling Web Applications . . . . . . . . . . . . . . . . . . . 72.2.1 Crawling Classic Web Applications . . . . . . . . . . 72.2.2 Crawling Modern Web Applications . . . . . . . . . . 72.2.3 Problem and Proposed Solutions . . . . . . . . . . . . 82.3 Performance Improvements . . . . . . . . . . . . . . . . . . . 92.3.1 Time Performance Improvement . . . . . . . . . . . . 92.3.2 Memory Performance Improvement . . . . . . . . . . 102.4 Graph Databases . . . . . . . . . . . . . . . . . . . . . . . . 103 Reverse Engineering Crawljax for Optimization . . . . . . 123.1 Crawljax High Level Algorithm . . . . . . . . . . . . . . . . 133.2 Reverse Engineering Crawljax . . . . . . . . . . . . . . . . . 133.2.1 Investigating Crawljax for Alternative Solutions . . . 14ivTable of Contents4 Motivation and Research Goals . . . . . . . . . . . . . . . . . 184.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . 204.2.1 Improving Time Performance . . . . . . . . . . . . . 204.2.2 Increasing State Coverage . . . . . . . . . . . . . . . 214.3 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.3.1 Time Performance . . . . . . . . . . . . . . . . . . . . 224.3.2 Memory Performance . . . . . . . . . . . . . . . . . . 234.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 Optimizing State Transition Management . . . . . . . . . . 265.1 Tracking State Changes in Web Applications . . . . . . . . . 275.2 Foundations for the Proposed Techniques . . . . . . . . . . . 285.2.1 Tracking DOM Mutations . . . . . . . . . . . . . . . 285.2.2 Mutation Events . . . . . . . . . . . . . . . . . . . . . 295.2.3 Mutation Observers . . . . . . . . . . . . . . . . . . . 295.2.4 Mutation-Summary Library . . . . . . . . . . . . . . 295.2.5 Proxies . . . . . . . . . . . . . . . . . . . . . . . . . . 305.3 Alternative Methods for Tracking State Changes . . . . . . . 305.3.1 Shared Characteristics of the Solutions . . . . . . . . 315.3.2 Proxy-Based Solution . . . . . . . . . . . . . . . . . 325.3.3 Plugin-Based Solution . . . . . . . . . . . . . . . . . . 335.3.4 Plugin-Based Solution with In-Memory Agent . . . . 335.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.4.1 Research Questions Broken Down . . . . . . . . . . . 355.4.2 Experimental Design and Methodology . . . . . . . . 355.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 376 Increasing the Number of States Crawled . . . . . . . . . . 546.1 Memory Management . . . . . . . . . . . . . . . . . . . . . . 556.1.1 Java Virtual Machine Heap . . . . . . . . . . . . . . . 556.1.2 Garbage Collection . . . . . . . . . . . . . . . . . . . 566.2 Criteria for Alternative Solutions . . . . . . . . . . . . . . . . 576.3 Criteria for Selecting a Graph Database . . . . . . . . . . . . 586.3.1 Required Functionalities . . . . . . . . . . . . . . . . 596.3.2 Licensing . . . . . . . . . . . . . . . . . . . . . . . . . 606.3.3 Partial Locks . . . . . . . . . . . . . . . . . . . . . . 606.3.4 Storing Objects in Nodes and Edges . . . . . . . . . . 606.3.5 API for Java and Maven Availability . . . . . . . . . 616.4 Comparison of Graph Databases . . . . . . . . . . . . . . . . 61vTable of Contents6.5 Scalable Solution . . . . . . . . . . . . . . . . . . . . . . . . . 636.5.1 Memory Performance Trade-Offs . . . . . . . . . . . . 646.5.2 Time Performance Trade-Offs . . . . . . . . . . . . . 646.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656.6.1 Research Questions . . . . . . . . . . . . . . . . . . . 656.6.2 Experimental Design and Methodology . . . . . . . . 666.6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 726.7 Further Memory Analysis . . . . . . . . . . . . . . . . . . . . 747 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 857.1 State Transition Management . . . . . . . . . . . . . . . . . 857.1.1 RQ1.A: Time Performance of the Proxy-Based Solu-tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 857.1.2 RQ1.B: Time Performance of the First Plugin-BasedSolution . . . . . . . . . . . . . . . . . . . . . . . . . 867.1.3 RQ1.C: Time Performance of the Second Plugin-BasedSolution . . . . . . . . . . . . . . . . . . . . . . . . . 867.2 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 867.2.1 RQ2.0: Determinism of the Crawler . . . . . . . . . . 877.2.2 RQ2.A: Correctness . . . . . . . . . . . . . . . . . . . 877.2.3 RQ2.B: Memory Performance . . . . . . . . . . . . . 877.2.4 RQ2.C: Time Performance . . . . . . . . . . . . . . . 887.3 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . 887.3.1 Internal Validity . . . . . . . . . . . . . . . . . . . . . 887.3.2 External Validity . . . . . . . . . . . . . . . . . . . . 897.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 897.4.1 Further Scalability through String Optimization . . . 897.4.2 Further Scalability through Candidate Elements Op-timization . . . . . . . . . . . . . . . . . . . . . . . . 907.4.3 Scalability Improvement by Text Compression Tech-niques . . . . . . . . . . . . . . . . . . . . . . . . . . . 907.4.4 Improving Time Performance Targeting . . . . . . . . 908 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 928.1 Optimizing the Crawling Process . . . . . . . . . . . . . . . . 928.2 Graph Database Applications . . . . . . . . . . . . . . . . . 939 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97viTable of ContentsAppendixA Deterministic Candidates . . . . . . . . . . . . . . . . . . . . . 102viiList of Tables4.1 Average DOM size in characters . . . . . . . . . . . . . . . . 195.1 State transition management experiments objects . . . . . . . 375.2 Proxy-based time consumption (milliseconds); the proxy-based?time overhead? is much greater than the time improvementsi.e. ?saved time?. . . . . . . . . . . . . . . . . . . . . . . . . 415.3 Standard deviations of proxy-based time consumption . . . . 425.4 Relative standard deviations percentages (absolute values ofcoefficient of variation) of proxy-based time consumption . . 435.5 Maximum values of proxy-based time consumption (millisec-onds) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.6 Minimum values of proxy-based time consumption (millisec-onds) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.7 Proxy-based bypassed comparisons: the detailed informationon events fired, comparisons that were categorized as unnec-essary, false negatives and false positives. . . . . . . . . . . . 465.8 External js-plugin-based solution times (milliseconds): thetime improvements by the first plugin-based solution is muchgreater than the time overheads it imposes on the crawlingprocess. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.9 Standard deviations of external-js plugin-based time consump-tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.10 Relative standard deviations percentages (absolute values ofcoefficient of variation) of external-js plugin-based time con-sumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.11 Maximum values of external-js plugin-based time consump-tion (milliseconds) . . . . . . . . . . . . . . . . . . . . . . . . 485.12 Minimum values of external-js plugin-based time consump-tion (milliseconds) . . . . . . . . . . . . . . . . . . . . . . . . 49viiiList of Tables5.13 External js-plugin-based bypassed comparisons: the detailedinformation on events fired, comparisons that were catego-rized as unnecessary, false negatives and false positives. . . . 495.14 Internal js-plugin-based solution times (milliseconds): the sec-ond plugin-based solution overtakes the default process. Thetime it improves is greater than the time overhead it imposeson the crawling process. . . . . . . . . . . . . . . . . . . . . . 505.15 Standard deviations of internal-js plugin-based time consump-tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.16 Relative standard deviations percentages (absolute values ofcoefficient of variation) of internal-js plugin-based time con-sumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.17 Maximum values of internal-js plugin-based time consump-tion (milliseconds) . . . . . . . . . . . . . . . . . . . . . . . . 525.18 Minimum values of internal-js plugin-based time consumption(milliseconds) . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.19 Internal js-plugin-based bypassed comparisons: the detailedinformation on events fired, comparisons that were catego-rized as unnecessary, false negatives and false positives. . . . 536.1 Graph databases comparison: Neo4j emerged to be the mostsuited graph to our application. . . . . . . . . . . . . . . . . . 626.2 Graph databases comparison scale levels . . . . . . . . . . . . 636.3 Correctness experiments objects . . . . . . . . . . . . . . . . . 696.4 Scalability experiments objects . . . . . . . . . . . . . . . . . 716.5 Scalability experiments results: the proposed version outper-forms the default crawler in terms of the number of statescrawled. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 756.6 Time performance experiments results: the default crawlerperforms more efficiently on most of the cases. However, intwo cases the proposed solution overtakes the default crawlerand in six out of ten cases the time overhead is less than 10%. 776.7 Initial memory experiment specifications . . . . . . . . . . . . 786.8 Second memory experiment specifications . . . . . . . . . . . 78ixList of Figures2.1 Ajax vs. classic Web applications models. Source: [1]. . . . . 63.1 Crawljax high level algorithm . . . . . . . . . . . . . . . . . . 133.2 Crawljax high level algorithm with more details . . . . . . . . 175.1 Generic solution components . . . . . . . . . . . . . . . . . . 315.2 Proxy-based solution components . . . . . . . . . . . . . . . . 335.3 Plugin-based solution components?external injection . . . . . 345.4 Plugin-based solution components?internal injection . . . . . 345.5 Proxy-based solution time performance: the time overheadimposed by the proxy-based solution is much greater than thetime it saves by bypassing unnecessary comparisons. Hence,the default crawling performs more efficiently. The bars rep-resent the averages taken over 5 rounds of experiments pre-sented in table 5.2. . . . . . . . . . . . . . . . . . . . . . . . 405.6 External js-plugin-based solution times: the first plugin-basedsolution, which employs a Web server and injects scripts ex-ternally, overtakes the default process. The time improve-ments on average is much greater than the overheads im-posed. The bars represent the averages taken over 5 roundsof experiments presented in table 5.8. . . . . . . . . . . . . . 465.7 Internal js-plugin-based solution times: the time the secondplugin-based solution improves is greater than the time over-heads it imposes on the process. hence, the second plugin-based solution also overtakes the default state transition man-agement system. The bars represent the averages taken over5 rounds of experiments presented in table 5.14. . . . . . . . . 506.1 Java heap divided into generations and the runtime options.Source: [2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56xList of Figures6.2 Scalable vs. default no. of states crawled: in all cases theproposed version overtakes the default crawler in terms ofthe number of states crawled given a limited amount of heapmemory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736.3 Scalability improvements: the percentage of increase in thenumber of states crawled. . . . . . . . . . . . . . . . . . . . . 746.4 Scalable vs. default time performance: the proposed versionovertakes the default crawler only in two cases out of ten andin six cases time overhead is less than 10 %. However, onaverage the time overhead is 18.20%. . . . . . . . . . . . . . . 766.5 Memory consumption of the default version: the experimentis carried out on google.com and the memory consumption isdemonstrated. . . . . . . . . . . . . . . . . . . . . . . . . . . . 796.6 Memory consumption of the proposed version: the experi-ment is carried out on Google.com and the memory consump-tion is demonstrated. . . . . . . . . . . . . . . . . . . . . . . . 806.7 Memory consumption of the default version: the experimentis carried out on Yahoo.com and the memory consumption isdemonstrated. . . . . . . . . . . . . . . . . . . . . . . . . . . . 816.8 Memory consumption of the proposed version: the experi-ment is carried out on Yahoo.com and the memory consump-tion is demonstrated. . . . . . . . . . . . . . . . . . . . . . . . 826.9 Default memory consumption: the lowest curve (in red) showsold generation allocation. On top of that, the young gener-ation (in salmon) is presented. The highest curve (in blue)shows the maximum amount of memory available to be allo-cated, i.e. Java heap size. The important observation here isthat the old generation allocation never drops as all the statesare detained in memory. In this experiment Java maximumheap size is set to 1 gigabyte and 330 states are crawled. . . . 836.10 Improved memory consumption: the lowest curve (in red)shows old generation allocation. On top of that young gener-ation is presented (in salmon). The highest curve (in blue )shows the maximum Java heap memory available to be allo-cated. One of the results of our work is that the old generationallocation occasionally drops, i.e. states data is freed. Thisenables the crawler to discover 542 states using 1 gigabyte ofmemory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84xiAcknowledgementsI wish to thank my supervisor, Dr. Ali Mesbah, for his careful supervision,guidance and support on carrying out the projects constituting this thesis.I would like to express my gratitude and appreciation for bringing about in-ternship opportunities for me during my M.A.Sc. program. Without doubt,this work was not possible without his generous support, help and accurateattention.I would like to thank Peter Luong, president and founder of FusionPipeSoftware Solutions, for creating internship opportunities which provided meexperience in preparing for industry and support in completing my pro-gram. I would also like to thank him for his kind supervision and generousattitude in sharing his invaluable expertise with me during the projects. Es-pecial thanks to Sang Mah, whose enthusiasm, extraordinary positive atti-tude, and tireless efforts in MITACS, bring about numerous fruitful industryexperiences for students such as me at UBC.Moreover, I would like to thank my committee members, Dr. KarthikPattabiraman and Dr. Sathish Gopalakrishnan for kindly agreeing to par-ticipate in the committee for evaluating my thesis. I would like to thankthem for their utmost generosity with their time, great improvement ideas,and constructive feedback for revising my thesis.I would also like to thank Dr. Pooyan Fazli for providing me with hiseffective mentorship and valuable time during the course of the program.Last but not least, I would like to thank our friendly community inSoftware Analysis and Testing (SALT) lab who helped me greatly duringthe course of this research with their encouragement, support and feedback.xiiDedicationI would like to dedicate this work to my family; my parents whose loveand support have always filled my heart with strength and hope. I wishto dedicate this work to them for giving me the opportunity to follow myheart and pursue my dreams. I would also like to dedicate this thesis to mydearest sisters, Mina and Maryam, who are the meaning of hope for me.I would also like to dedicate this thesis to my friends whose encourage-ment, support and understanding have always empowered me and kept megoing during the most difficult times.xiiiChapter 1IntroductionNowadays, Web applications play an important role in performing variousday-to-day tasks in businesses, industry, and individuals? daily lives. Per-forming tasks such as on-line banking or booking flight tickets has becomepossible by prevalence of Web applications. Pervasiveness of Web applica-tions in the carrying out tasks in modern societies calls for ensured reliabilityand enhanced quality. To that aim, various tools have been introduced tohelp Web developers and quality assurance professionals deliver more robustWeb applications. Web application crawlers are among the most importanttools in the area of Web application development and search engines. Webcrawlers are used in testing Web applications as well as in realizing theindexing of the World Wide Web (WWW) by search engines.Traditionally, Web applications were composed of a number of self-reliant documents identified by unique Uniform Resource Locators (URLs)[3]. These documents were linked to each other by means of inter-documentreferences called hypertext links or hyper-links. As such, in order to navi-gate through a Web application in a Web browser, each time a hyper-linkwas clicked a new document was retrieved, loaded in the browser window,and the new document would replace the previous content of the browserwindow. This design of Web applications resulted in inefficient time andbandwidth consumption because even for a minor change in needed in thedocument loaded in the browser, the whole document needed to be trans-mitted and replaced.As a significant progress towards the maturity of Web applications, inaddition to the hyper-link based URL transition mechanism, ?modern? Webapplications utilize ?Ajax? technologies to provide ?interactivity? and ?re-sponsiveness? [4] for users. Ajax technologies including scripting languagessuch as JavaScrip [5] and asynchronous method calls help achieve dynamicWeb Applications. Dynamic modern Web applications boost the interactiv-ity of Web applications by minimizing the time users need to wait for theapplication to be responsive again after each interaction such as clicking ona photo. The advent of modern Web applications greatly improved usersexperience, but on the otherhand, Web crawlers have to pay a price because1Chapter 1. Introductioncrawling event driven modern Web applications is inherently more resourcedemanding.Web application crawlers automate the crawling of Web applications byexploring hyper-link-based transitions and other types of transitions suchas Ajax-based ones. These multi-layer tools are used for various purposes.They can be used for indexing Web applications in search engines. They arealso used as a means for testing Web applications. Crawljax [4], a modernWeb crawler to be outlined in this thesis, is able to explore JavaScript-enabled Web applications.Crawling modern Web applications is inherently a resource (e.g. time,compute, memory) demanding task for various reasons. First, as apposed tocrawling classic static HTML Web pages, modern crawlers need to executeJavaScripts and apply the dynamic changes to the state of the Web appli-cation being crawled. Whereas in crawling traditional HTML-based Webapplications, a transition from one state to another state in the crawlingprocess is more of a retrieval and parsing process rather than an executionone. The event driven model of modern Web applications, the execution ofscripts, and navigating dynamic transitions are more time consuming thanmerely retrieving the pages and following static URLs. The reason is that amodern Web crawler needs to wait for a browser (or a tool that can executeJavaScript) to load the content and execute the JavaScript.In addition, modern Web application crawlers are composed of variouslayers and technologies working together to make the crawling process fea-sible. Specially the additional components for executing JavaScript andmanaging the states navigations impose considerable resource requirementson the crawling process. For example, in crawling modern Web applicationsthe crawler needs to communicate with the execution layer (e.g a browser).These inter-process communications cause a sizable time consumption onthe crawling process.In this work, we aim at improving the memory and time performancesof the crawling process in modern Web applications. We investigate twocomponents of the crawling process to improve the time and memory effi-ciency of the crawling. First we focus on the changes that occur in the Webapplications and induce state transitions. By doing so, we aim to improvethe time efficiency of the state transition mechanism of the crawling process.In addition, we investigate the memory utilization of the crawling processto improve the state coverage of the crawling. In other words, we aim atincreasing the number of states crawled given a limited amount of resources(memory).In the optimization of states transition management part, we propose2Chapter 1. Introductionthree different solutions, improved incrementally over one another, to re-duce the time consumption of tracking changes in the crawling process. Weemploy different methods and techniques to achieve improvement in timeperformance.In order to increase the number of states crawled given a limited amountof memory, we introduce a new component to the crawler to free the memoryrequired for saving the states of the Web application in the crawling process.To this aim, we replace the current storage and retrieval component whichrelies on the main memory by our proposed solution to crawl a greaternumber of states.The first alternative solution we proposed, utilizing a proxy, did notovertake the default state transition management. However, the secondenhancement proposed, without a proxy, was able to improve the time per-formance of the state transition management by 253.34%. That is, the timeconsumption of the default state transition management is 3.53 time theproposed solution time. The third alternative solution also overtook thedefault mechanism by 197.48%. The second solution demonstrated the bestperformance among all available mechanisms.Moreover, the scalability of the crawling process is improved by 88.16%.That is, the proposed solution covers a considerably greater number of statesin crawling Web applications. This scalability improvement was achievedcosting 18.20% of time overhead. Finally, we identified the bottlenecks ofscalability so as to be addressed in future work.The rest of this work is organized as follows. In chapter two we presentthe background information related to the concepts and technologies usedin this work. In chapter three, we provide an overview of Crawljax, thecrawler on which our ideas are examined. This overview is based on thereverse engineering endeavor we did to shed lights on the rapidly changingcharacteristics of the crawler. Chapter four covers the motivation and theresearch goals. What follow afterward are the two main chapters coveringthe improvements we targeted in the crawling process; the time performanceof the state transition in chapter five and increasing the number of statescrawled in chapter six. Afterward we revisit the research questions, addresssome of the threats to validity and the future work in the discussion inchapter seven. Finally a review of the related work finalizes the thesis.3Chapter 2BackgroundIn this chapter we briefly touch upon the concepts and information thatconstitute the foundations on which this thesis is based. We start off byshedding some lights on Web applications concepts because we will discussWeb applications as objects of the experiments throughout the thesis. Webapplications are directly used in designing the experiments. They are alsopivotal in explaining many parts of the thesis such as the crawling process inWeb crawlers and the memory analysis we peformed. In addition we brieflyexplain the crawling of both traditional and modern Web applications. Af-terward we proceed to pinpoint the aspects of the crawling process that weaim to improve.2.1 Web ApplicationsA Web application is any piece of software that is built to be run in acommon Web browser such as Mozilla Firefox R?. In other words, a Webapplication is a special type of client-server application that uses a Webbrowser as its client. Web applications usually consist of two components(sides), a client side and a server side. In this thesis we focus only on theclient side. The area of Web application development is diversely rich andthere are manifold technologies, languages and standards used in developingWeb applications. We shortly give an overview of some related technologiesin the following.Web applications constitute a considerable portion of deployed softwarein today?s computing and business ecosystems. Web-mail services, newsbroadcasting Web sites and on-line banking portals are examples of Webapplications used by users on a daily basis. Web applications vary greatlyin complexity and size, from only a few lines of static HTML documents tohundreds of thousands of lines of code in a dynamic multi-tier application.Technologies, languages, approaches and techniques used in creatingWeb applications are heterogeneously diverse. Just to name a few, Hy-perText Murk-up Language (HTML), eXtensible Markup Language (XML)and Cascading Style Sheets (CSS) are used for creating the content and42.1. Web Applicationsdefining the presentation of the content in the client side user interface ofWeb applications. Scripting languages such as JavaScrip [5] (EcmaSript [6])empower the client side of Web applications with dynamic execution. Theyalso provide control over content modifications and help reduce the amountof data transferred between the browser and the server by enabling dynamicupdates to the user interface. On the server side, PHP, ASP.NET, and Javatechnologies are popular for building the engines of Web applications in theserver side. Languages and technologies used in Web development are toonumerous to be mentioned here so in the rest of this chapter we merelysuffice to explain the concepts that are directly of pertinence to our work.2.1.1 HTML and DOMThe information openly published on the Internet is most effective when itcan serve a vast audience. In order to maximize the reach of the contentpublished on the Web, in early days of the advent of the Internet, a uni-versal language that all computers could agree on was a desideratum. Tothat aim, HTML, developed by Tim Berners-Lee and popularized in 1990s,became the ?mother tongue? of the World Wide Web (Web). HTML is usedin developing Web applications to create content, retrieve documents by hy-pertext links, add forms for accessing services, and including multimediain a page [7]. HTML role in Web applications is interrelated with its livetransformation which is called HTML DOM.As it is set out by World Wide Web Consortium [8] (W3C) in the spec-ification of Document Object Model (DOM), the DOM is a language- andplatform-independent convention that aims at unifying the usage of the doc-uments in computer programs. Using the DOM interface, programs andscripts achieve a unified way of accessing and modifying documents. Nu-merous types of documents including XML, and XHTML documents, whichare incremental improvements over HTML, are used in the area of Webapplications.When an HTML document is retrieved from a Web server, the browserparses the page and builds an object representing the page based on theDOM specification. This object is called DOM object or DOM tree. Thescrips of the Web application (e.g. JavaScript) can access and modify theDOM tree. These accesses and manipulations can retrieve and modify thepresentation, content and the structure of the DOM tree. The DOM is alsoused by the layout engines of browsers. Layout engines such as WebKit [9]use the DOM interface to render the presentation of Web pages.52.1. Web Applications2.1.2 Modern vs. Classic Web ApplicationsThe event driven model of modern Web applications utilizes ?Ajax? tech-nologies to reduce the responsiveness gap between classic Web applicationsand desktop applications. Ajax is a collective short form for the combina-tion of Asynchronous JavaScript and XML. Modern Web applications usedifferent Ajax technologies. For example, XML and HTTP requests areused for communicating data, HTML DOM for live presentation of user in-terfaces, and CSS for a presentation layer over the content. In addition,JavaScript is used for DOM manipulation and making links between all theaforementioned technologies [1].As shown in figure 2.1 from [1], the model of modern Web applica-tions differs from that of classic Web applications in terms of data exchangemethod and user interactions. In classic Web applications, the applicationis a set of separate ?whole? HTML pages with logical links. Theses pagesare connected together by hypertexts links. In classic Web applications, thebrowser provokes an HTTP request to the server of the application onceusers click on a link or submit a form. The server then analyses the requestand creates the response as a complete new HTML page [1].Figure 2.1: Ajax vs. classic Web applications models. Source: [1].On the other hand, in modern Web applications, an Ajax engine en-hances the user experience by eliminating the need for reloading a wholenew page each time data is exchanged with the server. That is, users ac-tions on a modern Web application result in sending asynchronous requests62.2. Crawling Web Applicationsto the sever of the application. In the mean time, the application is respond-ing and not waiting on a hold for the response while the server processes therequest and responds the data back. The key difference here is that insteadof reloading the whole application by a new HTML page, the Ajax engineprocesses the response data and applies partial changes to the DOM of theapplication. Hence, the data exchange is minimal and the user does notneed to wait for the response and a complete reload of the application [1].As modern Web applications are richer and have a greater number ofcomponents than classic Web applications, the crawling method for eachcategory is different. In the following we give an overview of the crawlingof each of these two types of Web applications and illustrate how they aredifferent.2.2 Crawling Web ApplicationsCrawling Web applications is the process of exploring the structure andthe content of Web applications. In classic Web applications, exploringthe applications is carried out by following hypertext links. On the otherhand, in modern Web applications, elements such as div and span can alsobe sources of transition form one page of the application to another page.Hence, the crawling process is to some extent more resource intensive inmodern Web applications. We cover the differences in crawling the twoclasses of Web applications in what follows.2.2.1 Crawling Classic Web ApplicationsHypertext links are pivotal in crawling classic Web applications. A classicWeb crawler, comparative to the way general purpose search engines suchas Google R? crawl, starts form a provided URL, retrieves the page, paresesthe page and extract all absolute and relative URLs and adds them to thequeue designated for URLs to be crawled next. This process is repeatedrecursively till the queue is emptied or certain conditions are met.2.2.2 Crawling Modern Web ApplicationsOn the other hand, in modern Web applications the dynamic state changesare taken into account. In crawling modern Web applications, the presenceof an Ajax engine in the browser is pivotal. This makes crawling modernWeb applications different in a number of ways. First, there are elementsother than hypertext links (e.g. div) that can transition the state of the72.2. Crawling Web Applicationsapplication. This transitions is done to discover new states and information.In addition, the dynamic nature of JavaScripts requires the crawling processto wait for the execution of JavaScripts and to allow the DOM tree to besettled after firing events.Because JavaScript based reactions (function invocations) can be at-tached to almost any element in a Web page, modern Web applicationcrawlers, in addition to parsing each page and extracting its hypertext links,need to examine other types of elements for exploring Web applications.Crawljax [4], a modern Web application crawler, for example, can be con-figured to examine a set of HTML elements for exploring Web applications.As a result of the differences mentioned above, the crawling process ofmodern Web applications is more detailed than that of classic applications.The crawling algorithm for a modern Web crawler such as Crawljax startswith opening a URL, extracting all hypertext links and elements such asdiv, button, and span and saving them in a candidates? queue. Afterwardthese elements are polled one by one and an event (such as click or hover)is fired on each of them. As non-href (non hyperlink) elements do notguarantee a change in the URL, a sufficient amount of time is requiredfor allowing the dynamic changes and server communications to finish andto let the DOM settle. Then the crawler checks if more information isdiscovered. If so, the newly discovered parts of the application is parsedfor candidate elements and any found element is added to the queue. Thisprocess continues recursively.2.2.3 Problem and Proposed SolutionsAs it was explained earlier, compared to crawling classic Web applications,crawling modern Web applications requires carrying out a greater numberof steps. In particular, requirements such as examining a much wider setof HTML elements (e.g. span, href, and div), executing JavaScript code,and communicating to the server and the browser at each state make thecrawling of modern Web applications more resource consuming.Crawling modern Web applications is resource intensive. Consumingresources such as time and memory is of a considerable degree in moderncrawlers. The reason is that the crawling process needs to employ additionalcomponents for performing the sub-tasks of modern crawling that does notexist in classic crawling.There are a number of steps which are pertinent only to modern crawling.First, handling the execution of JavaScript has to be done in one way oranother such as employing a browser. In addition, tracking dynamic state82.3. Performance Improvementschanges must be handled as any interaction can lead to a new state inthe application. Furthermore, managing the interactive communicationsbetween the client and the server in each dynamic state is essentially moretime consuming than merely retrieving the data.In order to mitigate the aforementioned resource demanding qualitiesof crawling modern Web applications, in this thesis we aim at identifyingand tacking some of the performance problems of the crawling process. Wespecifically focus on improving the time and memory performance of thecrawling process. Regarding the time performance improvement, we focuson state transition management of the crawling process. In addition, weinvestigate the state storage and retrieval aspects of the crawling process toimprove the memory performance of the crawling process. The discussionis followed by expanding upon the performance improvements that will bediscussed later on in this thesis.2.3 Performance ImprovementsImproving the performance of a system is defined as tuning the way re-sources are consumed by the system in achieving goals . As mentionedearlier, crawling modern Web applications imposes extra resource require-ments on the underlying system the crawler is running on. Time, memory,and compute resources are examples of system resources that are consumedmore intensively by a modern Web application crawler compared to classiccrawlers. Hence, here we aim at improving the time and memory perfor-mance of the crawling process by tuning the way specific aspects of thecrawling are carried out.2.3.1 Time Performance ImprovementThe process of crawling modern Web applications is more time consum-ing than the classic one because modern crawling employs more layers oftools working together to achieve the task of crawling. The time neededfor performing inter-process communications, performing rounds of requestsand responses between the client and the server, and waiting for the userinterface to settle after interactions are among the time consuming tasksperformed by the crawling process. In addition, the way specific sub-taskssuch as managing the state transitions are performed in the crawling pro-cess creates optimization opportunities. In this work, we focus on the statetransition management of the crawling process to optimize for reducing timeconsumption.92.4. Graph Databases2.3.2 Memory Performance ImprovementMemory is one of the system resources that are extensively used in thecrawling process. Loading layers of different tools and components that arenecessary for performing the crawling is an important inducer of memoryconsumption. In addition, parsing HTML pages to extract HTML elementsrequires significant memory allocation. Another critical part of the crawlingprocess is handling the storage, comparison, retrieval and identification ofthe states. In this work, we focus on improving the memory performance ofthis part of the crawling process to improve the scalability of the crawlingprocess.We utilize a graph database to overcome the obstacles hindering thescalability of the crawling process. Graph databases are shortly introducedin the next section to facilitate the presentation of the application of a graphdatabases to the Crawler.2.4 Graph DatabasesGraph databases are a sub-category of NoSQL databases. NoSQL databaseswhich are also called ?Not only SQL? databases are a category of databasesthat provide additional ways other than SQL for storing and retrieving data.A graph database utilizes graph data structure for storing and accessingdata. The key is that each item in a graph database provides direct ?look-upfree? references to its adjacent elements. In other words, a graph databaseprovides index-free adjacency.Graph databases are optimized for utilization in software systems thatoperates on a data that is naturally presentable by graphs. The reason isthat graph databases avail of graph structures, graph nodes, and graph edgesfor designing the data model and storing the data. Aside from the ease ofmodeling graph-like data, graph databases are geared to provide efficientmechanisms for performing graph-specific operations such as finding theshortest path between two nodes in a graph.Graph databases vary in terms of capabilities, features and the scopefor which they are designed. Fore example, some graph databases providetransaction handling and concurrent access while others do not. There aregeneral purpose graph databases and some more specialized ones such astriple stores. Graph databases are widely used in industry specially bysocial networking Web applications. For example, FlockDB [10] is used forsupporting the storage and retrieval of the underlying system of Twitter R?[11].102.4. Graph DatabasesWe implement the aforementioned improvements as enhancements toCrawljax and perform empirical experiments to evaluate the suggested en-hancements. Hence in the next chapter Crawljax will be discussed in detailsto explain its default algorithm and the suggested enhancements we pro-posed for the crawling process.11Chapter 3Reverse EngineeringCrawljax for OptimizationCrawljax [4] is a software tool for crawling modern Web applications. Itspecifically is delivered as a solution for answering the challenges of crawlingAjax-based Web applications. The challenges arise from some characteris-tics of modern Web applications. These characteristics include execution ofscripts on the client side, asynchronous communications between the clientand the server, and dynamic manipulation of DOM trees. These charac-teristics result in dynamic state changes in Ajax-based Web applicationswithout a necessary change in the URL of the Web application. Crawljaxcaptures these state changes and the transitions between states in a finite-state machine model. The structure underlying this finite-state machineis called stateflow graph by its designers and here after we refer to it assuch. The stateflow graph, central to the memory enhancements suggestedin this thesis, is a directed multigraph1 with states at nodes and transitionsat edges.Crawljax utilizes selenium Web driver [12] to bring up browsers and crawlWeb applications. Running Web applications in a real browser, Crawljaxensures the model it creates matches the real behavior of the Web appli-cation under crawl. The state machine created incrementally by Crawljaxcomprises the state-space of the Web application under crawl. The statemachine also includes the navigational path between states. This state ma-chine can be used for various purposes such as quality assurance and programcomprehension.In this thesis we propose some enhancements to the Crawljax in orderto improve the state coverage and time performance of the crawling process.We explain the internal structure of Crawljax and the parts of its crawlingalgorithm that are directly related to this work to achieve more clarity inpresenting the work performed in this thesis.1A multigraph is a non-simple directed graph in which loops and multiple edges betweenvertices are permitted.123.1. Crawljax High Level Algorithm3.1 Crawljax High Level AlgorithmForm a high level perspective, Crawljax works as the following. A crawlsession starts by providing the URL of the Web application and additionaloptional configurations as inputs to the crawler. A Web browser is broughtup and it is driven to open the first page of the Website referenced bythe URL. Afterward The loaded page is parsed and all of the candidateelements in the page such as buttons and URLs are identified to be clickedon in the next steps. Clicking on elements can lead to transitions to newstates. Hence, after clicking on each candidate element, the current stateof the DOM is compared to the previous state of the DOM. If the state ischanged and it is not a replica it is added to the inferred state machine.input : URL, Configurationsoutput: stateF lowGraph, Logs1 Open the URL in the browser;2 Parse the loaded DOM and extract all the candidate elements;3 Save the candidate elements in candidateElements ;4 while termination conditions not met do5 Pick an element from the candidateElements;6 Fire a click event on the element;7 Extract the state of the Web application form the browser;8 Compare the new state with last captured state;9 if the new state is different from the previous state then10 Compare the new state to all previous states;11 if no replica of the state found in the stateF lowGraph then12 Add the new state to the the stateF lowGraph;13 Extract the candidate elements form the new state andstore them in candidateElements;14 Add the transition to the the stateF lowGraph;Figure 3.1: Crawljax high level algorithm3.2 Reverse Engineering CrawljaxAs Crawljax is being evolved rapidly, relying on the initial papers does notsuffice for a deep understanding of the ?current? state of the system. So133.2. Reverse Engineering Crawljaxwe conducted a thorough source code investigation to be able to proposealternative solutions for optimizing the crawling process.In this thesis, we investigate if we can utilize a graph database for storageand retrieval of data in Crawljax. As the main memory is usually smallerthan secondary memories in computer systems our aim is to optimize thememory consumption by storing the data in a graph database instead ofmain memory. Based on our previous experiments with crawling Web ap-plications with Crawljax, we knew that the states of the Web applicationsamount for a sizable memory consumption. Hence we came up with the ideaof storing the stateflow graph of the Crawljax in a graph database. Therewere numerous graph databases options. In order to identify if there is aright choice of database for our application, we analyzed Crawljax imple-mentation to see what criteria and requirements emerge that guide us inselecting the most amenable graph database.3.2.1 Investigating Crawljax for Alternative SolutionsWe investigated the source code of Crawljax to extract the criteria for se-lecting a graph database most suited to our application. Our findings fromthis investigation of Crawljax?s internal components are as follows.Crawljax opens a Web browser and goes to the URL of the application.After loading the index page, it examines the DOM tree of the page andextracts all the elements which are likely to change the state in case an eventis fired on them. Then Crawljax fires events on these candidate elements andanalyzes the changes. Finally, based on the analysis of the changes madeto the DOM Crawljax incrementally builds a stateflow graph which modelsthe states deduced by crawling the application. In order to save this modelwhich represents the states and the transitions between them as a directedmultigraph is utilized. This model is defined as follows.The stateflow graph inferred from crawling a Web application A is de-noted as a four-tuple (i,V ,E,L). The meaning of these symbols are [4]:? i: i is the initial state of the Web application. This state is capturedwhen the page is finished loading into the browser and the onloadevent is fired by the browser.? V : V is the set of all states in the application. Each dynamic DOMstate is represented by a vertex v which is a member of the set V .? E: E is the set of all edges in the stateflow graph. The members of Eare ordered pairs of (v1,v2) which represent an edge from the vertex v1143.2. Reverse Engineering Crawljaxto vertex v2. These directed edges exist in the graph when state v2 isreachable from state v1 by firing some event on on a clickable (a Webelement that can be clicked on) c in state v1.? L: L is a function from E to the set of event types and DOM elementproperties.This provides us with a solid mathematical understanding of the data struc-ture that we need to provide in our alternative solutions.There are also other important points which must be taken into con-sideration. If a change is detected in the state of the Web application thenewly detected state is compared to all states in the state-machine. If aduplicate state is found the new state will not be added to the stateflowgraph. However, regardless of the newness of the found state, a new edge isadded to the stateflow graph, starting from the previous state to the currentstate. Edges hold additional information such as the type of the event thatwas fired and the element which the event was fired on. In addition thecurrent state of the state-machine is updated to the newly resulted state.This means the crawling is continued by exploring the new state and hencea depth first traversal.Crawljax has an option to crawl an application with multiple browsers.In the concurrent crawling sessions, all crawling nodes share the same state-flow graph but each crawler node has its own state machine. The statemachine is an abstract interface on top of the stateflow graph. Hence thestateflow graph is a more concrete data structure in Crawljax. The statemachines in combination with the stateflow graph insure the synchronizedreading and updating of the states and navigational paths.In other words, the state machine is divided into a global stateflow graphfor all crawling nodes and multiple state machine instances, one for eachcrawling node. The crawling nodes are synchronized over the clickable ele-ment they are examining. This measure is taken to prevent exploring thesame clickable by multiple crawling nodes.Among numerous classes and data structures in Crawljax, there are twodata structures that are of critical importance when considering the conse-quences of concurrent modifications. In particular, these data structures,encapsulate the logic of the steps that we need to implement for safe con-current storing and retrieving of the data form and into the graph database.There is a data structure that stores all the states with candidate el-ements and is called statesWithCandidates. The second data structure,which is called cache, is a map from states to queues of actual candidate153.2. Reverse Engineering Crawljaxelements. Each queue stores a series of candidate elements which are ex-tracted from the Web page. These elements are extracted when the browserhas loaded this specific state mapped to the queue. Crawljax begins crawl-ing by loading the index state and then running a crawling node calledCrawlTAskConsumer with which the first state, the index state, is crawled.The candidate elements of the first state are put into the cache as a startingpoint for the rest of the crawling.A more detailed algorithm but still very high level of the multi-nodecrawling process of Crawljax is illustrated in algorithm 3.2.163.2. Reverse Engineering Crawljaxinput : URL, Configurationsoutput: stateF lowGraph, Logs1 Open the URL in the browser and load the index page;2 Parse the index and extract all the candidate elements;3 currentState? index;4 stateWithCandidates? {index};5 cache? {(index,CE)} such that CE = {e: e is a candidate elementin the index};6 stateF lowGraph? ?;7 while stateWithCandidates 6= ? do8 Poll a state s from stateWithCandidates ;9 Reset the browser;10 Move the browser from index to s;11 Retrieve s.candidateElements from cache ;12 while s.candidateElements 6= ? do13 Poll an element e of s from s.candidateElements;14 Fire a click event on e;15 detectedState? current state of the application in thebrowser;16 if detectedState 6= s then17 Add e to crawlPath;18 Compare detectedState with all other states instateF lowGraph;19 if detectedState 6? stateF lowGraph then20 stateF lowGraph? stateF lowGraph +{detectedState};21 Parse detectedState for candidate elements;22 stateWithCandidates? stateWithCandidates +{detectedState};23 cache? cache + {(detectedState, CanElements)}such that CanElements = {e: e is a candidate elementin detectedState };24 else25 crawlPaths? crawlPaths + {crawlpath};26 stateF lowGraph? stateF lowGraph + {e};27 currentState? detectedState;Figure 3.2: Crawljax high level algorithm with more details17Chapter 4Motivation and ResearchGoalsIn order to realize a system for crawling Ajax-enabled Web applications, nu-merous layers of software tools and technologies such as JavaScript executionengines and DOM processing libraries must be put in place to obtain a pur-poseful crawling outcome. Ajax-enabled Web applications are crawled fordifferent purposes including, ?program comprehension? and ?analysis andtesting of dynamic Web states?. For example, the model Crawljax inferscan be used for ?generating a static version of the application? [4].Because of its inherent complexity, a purposeful crawling process, capa-ble of producing a pragmatic model for Web applications, introduces man-ifold areas of optimization. These areas of optimization, once consideredcarefully, have the potential to examine novel ideas for improving the crawl-ing of Web applications.In the following we explain the goals of the thesis by enumerating somechallenges associated with crawling Web applications and propose our ideasfor attacking these problems. We put forward two high level research ques-tions to pinpoint the bird?s-eye view of the goals of the thesis. Afterward,we elaborate on these questions by taking them to lower levels of abstractionto target more specific questions.4.1 MotivationCrawling Ajax-enabled Web applications is a predominantly resource con-suming task for the computer system hosting the crawler. This high degreeof resource consumption lies, in large part, on two sets of drivers. First,Ajax crawlers perform various sub-tasks by utilizing a multi-layer stack oftools. The second reason is that Web applications, which are the objects ofthe crawling, can be of considerable size in terms of the memory it requiresto store their various components such as their DOM or their GUI.In crawling Web applications, tasks such as retrieving a Web applica-184.1. Motivationtion from its server, executing the application?s scripts (chiefly JavaScript),communicating with a browser, tracking states and state transitions, andprocessing the content of the DOM impose a considerable memory and timeconsumption on the computer system hosting the crawler.In addition, providing a mechanism for handling dynamic state transi-tions in Web applications imposes a high level of resource consumption oncrawling. Especially, when applications have many states and the states ofthe Web applications are of a large size memory consumption becomes morechallenging. Table 4.1 presents the data we collected about the size of theDOM trees during our experiments. The data shows that most of the Webapplications in the top list of Alexa have states that on average are of aconsiderable size. The average length of states across all applications listedin table 4.1 is around Kilo characters. Moreover, Crawljax stores a strippedversion of the DOM beside the original DOM. This stripped version has lessdetails but is of almost the same size. In addition, the size of Java ?char?type is two bytes. Hence, the size of the states in memory is on average one928 Kilobytes (multiplying the average length of the DOM by four).Table 4.1: Average DOM size in charactersWeb Application DOM Lengthgoogle.com 311671.49wikipedia.org 242418.29live.com 46303.59twitter.com 192321.88qq.com 603644.86amazon.com 290661.37linkedin.com 28188.43baidu.com 184074.15facebook.com 41485.17youtube.com 310531.81yhoo.com 304536.09average 232348.83In order to improve the crawling of Ajax-enabled Web applications, aim-ing at mitigating the memory and time consumption of the task of Web ap-plication crawling, we propose two enhancements to the crawling process. Inthe following, we state our enhancements targeting: 1) the improvement of194.2. Research Questionsthe time performance of the state transition management and 2) the memoryperformance of the state storage and retrieval component.4.2 Research QuestionsIn this thesis, we initially defined two research questions to pinpoint theobjectives of the thesis. These research questions address enhancing theperformance of the crawler in crawling Ajax-enabled Web applications. Inparticular, we aim at conducting research to assess our ideas for improvingthe time consumption and memory utilization of the crawler in performingthe crawling task. Our research questions address the following aspects ofthe crawling process:Idea 1: Improving the time consumption of crawling by revising the processof examining state transitions in Ajax-enabled Web ApplicationsIdea 2: Optimizing the memory utilization for improving the crawling pro-cess memory consumptionThe research questions are stated in the following and will be furtherbroken into more specific and more detailed sub-questions. Moreover, aswe made progresses in the thesis project and performed the experimentsin details, analysis of the results revealed interesting facts about the timerequirements and memory utilization of our proposed methods. These dis-coveries guided us to design further research questions to assess additionalenhancements and explore further corners of the crawling process.4.2.1 Improving Time PerformanceAs mentioned earlier, there are numerous aspects and sub-tasks in crawl-ing Web applications that have the potential to be investigated for findingoptimization opportunities. One important aspect of Ajax-enabled Web ap-plications is that they can change their states dynamically, namely they aredynamically stateful. More specifically, Ajax-enabled Web applications cantransition into different dynamic states as the Ajax technologies (such asasynchronous calls and script execution) change the state of the Web appli-cation upon occurrence of different events. Hence managing the changes inthe states and transitions between the states are of high importance in thecrawling process. As such, if we aim at modeling the states of the Web ap-plications during the crawling process we have to provide some mechanismfor keeping track of state changes in the Ajax-enabled Web applications.204.2. Research QuestionsKeeping track of state changes and transitions between the states im-poses a significant complexity on the crawler and hence creates potentialsfor optimization. One of the important steps in tracking the state changesis to determine if, at specific points in time, the state of the Web applicationhas changed or not. Another important issue is to check whether the stateto which we have just transitioned is a replica (i.e. a clone of one of thepreviously stored states) or is an unprecedented state. As such, we thinkthe state comparison step has the potential to be attacked for enhancing thecrawling time.Here we state our first idea for improving the crawling of Ajax-enabledWeb applications as a high level research question which later on will bebroken into more specific sub-questions:RQ1: How much can the time consumption of the state transition manage-ment be improved by enhancing the state comparison mechanism?The intuition behind this research question is that we know, from expe-rience, that comparing the states of the Ajax-enabled Web applications isan expensive task. State comparison is time demanding because it includesseveral sub-tasks dealing with multiple layers of software tools. For example,accessing the states requires communicating with a browser. In addition thestates are, on average, large objects. This translates to an expensive timerequirement for the extraction of the state data from the browser, initiationof the state objects on the crawler and performing the comparison.4.2.2 Increasing State CoverageMemory utilization is another aspect of the crawling of Ajax-enabled Webapplications that is of high implication to the performance of the crawler.By experience, from crawling Web applications with Crawljax in previousstudies we know that the states of Web applications are usually of consider-able size. States of Ajax-enabled Web applications are stored in the memorywhile a Web application is explored by the crawler. The large size of thestates multiplied by the number of states crawled, to any given point incrawling, accounts for a sizable amount of memory consumption. We be-lieve that this imposed consumption of main memory is one of the obstaclesfor increasing the coverage and comprehensiveness of crawling. As such, webelieve that improving the memory consumption is a potential optimizationarea for achieving more coverage in crawling Ajax-enabled Web applications.We state our second idea for enhancing the crawling of Ajax-enabled Webapplications as a research question which targets the memory utilization of214.3. Goalsthe crawling process:RQ2: To what degree can the coverage and comprehensiveness of crawlingbe improved by relinquishing state storage to a graph database?What we endeavor to achieve here is we would like to assess the results ofenhancing the main memory consumption on the state coverage achieved incrawling. Especially, we are interested in improving the way the states dataare currently stored in the crawling process. We do this by analyzing theeffects of enhancements made to the state storage mechanism on the memoryperformance of the crawler and on the number of states it can crawl.4.3 GoalsThe goal of this thesis is to improve the performance of crawling of Webapplications during which we assess our ideas for enhancing the crawlingprocess. In particular, we target improving time performance and memoryperformance of the crawling. In order to improve the time performance ofthe crawling, we aim at reducing the time the crawler spends on trackingstate changes and the transitions between states. In addition, in order toimprove the memory performance of crawling, we target the reduction ofthe per-state memory consumption of Web crawling so as to explore morestates and increase the coverage of the crawling in Web applications.4.3.1 Time PerformanceIn the process of crawling an Ajax-enabled Web application, as the ap-plication transitions to different states, the crawler requires to handle thetransitions and track and store the states and navigational paths betweenthem. In order to track the transitions, every time an event is fired on theWeb application (line 14 of algorithm 3.2), the crawler checks if the Webapplication has transitioned to a new state or not (line 16). This task isdone by comparing the states of the Web application before and after firingthe event. We believe this practice results in some inefficiencies that createsoptimization opportunities.In order to improve the time performance of the crawling, we aim atimproving the mechanism the crawler uses for state comparison. To performthe state comparison, the crawler maintains the two states (before and afterfiring events, denoted by s and detectedState) in the memory. These statesare extracted from the browser by taking snapshots of the DOM tree of224.3. Goalsthe application (line 15). Afterward the crawler compares the two states inorder to decide if an alteration has happened to the state of the applicationor not. However, extracting the DOM from browser, initiating a new stateobject, and comparing the objects after each event is very time consuming.Hence we suggest we should eliminate these steps whenever we can findfaster alternatives.If no event attribute (e.g. ?onclick?) is added to an HTML element,firing events (e.g ?click?) on the element will cause no change to the stateof the Web application. If we have an alternative way that can notify usthat no changes have happened to the DOM, we can safely skip the wholecomparison process and save the time that would have been spent other-wise. However, the alternatives must be fast enough to overtake the currentcomparison mechanism.4.3.2 Memory PerformanceMemory consumption of the crawling process is significantly considerable.Expensive DOM related operations such as comparison, stripping and stringreplacements impose a sizable memory requirement on the crawler. In ad-dition, IO processes, mainly used for communications with the browser addsubstantial memory overhead to the crawling. Furthermore, during the ex-ploration of Web applications, the crawler builds an inferred state machinein the main memory. This huge state machine, storing the states discov-ered in the crawl session as well as the transitional links between the states,consumes almost half of the memory allocated to the crawler.In order to improve the memory performance, we focus on the mecha-nism currently employed for managing the state machine model. Storing,retrieving and assuring the uniqueness of states are the key tasks carried outin the crawling process and are directly related to the state machine model.As a Web application is crawled, states are discovered one by one and addedto the state machine. The memory is allocated gradually for each detectedstate. However, the main memory of the system is limited. Hence, at somepoint, when a specific number of states are stored in the state machine, thereis no capacity left in the main memory for allocating space to crawling tasks(including adding a new state). As a result, the crawling process halts (onerrors), and having covered a specific number of states the crawler quites onan out of memory error.In order to improve the state space coverage in crawling Web applica-tions, we have to dismantle the memory limitation. Considering the hugesize of the stateflow graph, we target freeing the memory required for stor-234.4. Methodologyage and retrieval of the states and transitions between the states. We aimat freeing the memory so it can be allocated to the rest of the tasks in thecrawling process. If we are able to do so, we could continue the crawlingfurther and discover more states and subsequently cover a greater numberof states than the original crawler does.4.4 MethodologyIn this thesis we investigate if application of a number of enhancements to thecrawling process can result in time and memory performance optimization incrawling Web applications. Specifically, tracking the changes in DOM treesof Web applications is enhanced in three different methods as an endeavorto optimize the time spent in states comparisons. In addition we investigatehow utilizing a graph database for storage and retrieval of the state machinemodel affects memory performance of the crawling process.Crawljax has various extension points, called plug-ins, which can beutilized for performing additional tasks at specific points in the crawlingprocess. We also add new extension points to the Crawljax. Availing ofthe current and new extension points we apply our alternative methodsfor tracking DOM-tree changes in the applications. In order to expedite thestate comparison process we relinquish the change tracking to agents that wedeveloped to work on the browser side instead of the crawling engine?s side.This way we install an agent in the browser side and aim at bypassing theexpensive DOM extraction, initiation and comparison operations. Proxies(explained in next chapter) and Crawljax plug-ins are used to install thestate tracker agent.There is a sizable graph data structure in the crawling process, calledstateflow graph, which holds the main output of the crawling process. Aim-ing at making memory utilization more efficient, we introduce a new compo-nent to the crawler for replacing this data structure and handling the storageand retrieval of the states transitions model. The component that we intro-duce interfaces a graph database to store the stateflow graph structure anddata in a database rather than main memory. This method is expected toreduce the memory consumption, however it might impose an IO overheadaffecting both time and memory performances. It is the experimental resultsthat will determine to what extant this method is effective in boosting theperformance of the crawling.In order to assess the effects of the proposed enhancements on the timeand memory performances of the crawling process, we conduct experiments244.4. Methodologyon ten Web applications from Alexa?s top Websites list. As these Web sitesare complex Web applications composed of manifold components, they arenot deterministically crawlable. That is, crawling the same Web applicationmultiple times does not result in the same sequence of states each time. Assuch, in order to mitigate the effects of this nondeterministic behavior, wecrawl each applications multiple times and take average over the collectedresults.In the next two chapters we explain in details how we applied the two en-hancements to the crawling process. In addition, the way experiments wereconducted is discussed and the results of the experiments are also presentedat each chapter. The next chapter covers the improvements we proposed forstate transition management and the chapter after that discusses improvingthe memory consumption of the crawling process.25Chapter 5Optimizing State TransitionManagementThe crawler models the Web application being crawled by inferring a statemachine representing the states of the application and the navigational pathsbetween the states. Therefore, keeping track of the client-side states and thetransitions between the states is a substantial task in the crawling process.In order to manage the state transitions, the crawler communicates with thebrowser in which the application is running. These communications includetaking a snapshot of the state of the application at certain occasions to test ifany changes have been made to the state of the Web application (lines 15 and16 of algirithm 3.2). As these sub-tasks are expensive operations in terms ofthe time they consume, we investigate methods of optimizing the transitionmanagement aspects of the crawling process. Throughout this chapter theconcept of the ?state? of a Web application comes up frequently. Hencewe shed some light on what we refer to as the state early in the chapter toimprove the clarity.Web Applications States The state of a Web application at any givenpoint of time is dependent on values of multiple components of the Web ap-plication. This multivariate concept includes both server-side and browserside variables. The browser side state itself, cab be delineated to the valuesof the DOM-tree elements, the values of the JavaScript variables, JavaScriptfunctions and some less frequent elements such as Web Storage [13]. How-ever, as the crawler focuses on the DOM-tree of the Web application, in thisthesis, what we refer to as the state of the Web application is as follows:? In the browser the state of the applications is the DOM Object?s state.? In the crawler, the state is the string representation of the DOM-treeafter being processed and normalized (stripped).The crawler representation of the state has less details. For examplesome white spaces such as tabs and carriage returns are removed from the265.1. Tracking State Changes in Web Applicationsoriginal DOM tree. Hence, chances are that minimal changes to the DOM-tree are filtered by the crawler. The reverse cannot be true as every changein the crawler side must be originated from some change in the browserDOM-tree.5.1 Tracking State Changes in Web ApplicationsCurrently what Crawljax does to track the changes applied to the states ofWeb applications is as follows. After firing an event on some specific candi-date element in a page of the application, the crawler retrieves the currentstate of the application from the browser. For example, in photo gallery ap-plication, Crawljax clicks on one of the photo albums. As a result the GUIof the photo applications changes so that a number of photo thumbnails areloaded into the view. Crawljax retrieves the DOM of this new view of theapplication.Afterward the crawler constructs suitable data holders for shaping theraw data retrieved from the browser into usable objects representing thethe state of the Web application. Then it compares this newly retrievedstate (photo thumbnails DOM) with the previous state (the home page) soas to test if changes have been made to the application. Performing thesesub-tasks results in expensive inter-process communications (e.g. betweenthe browser and Crawljax), string operations, object construction and anumber of sizable IO operations for sending the state of the applicationfrom browser to the crawler. Here we investigate alternative approaches fortracking the state of the application aiming at eliminating or bypassing someof these operations, whenever possible, to improve the time efficiency of thecrawling.Investigating the way currently the crawler performs the transition checks,we came up with the idea of eliminating unnecessary comparisons. In par-ticular, if no event attribute is attached to an element, firing events on theelements causes no change in the state of the Web application. Hence, inthese situations, if we can be notified that there is no change in the appli-cation since last time, we can avoid:1. Retrieving a large amount of data representing the state of the Webapplication from the browser2. Processing the raw data representing the state of the application toinitiate the objects representing the state275.2. Foundations for the Proposed Techniques3. Performing a comparison on two usually large objects representing theprevious and newly retrieved statesThe key point here is that we do not aim at eliminating these stepscompletely; We only want to find alternative methods that are able to bypassthe unnecessary comparisons. To that goal, we need to find an effective wayto get notified of changes as they are made to the state of the application.5.2 Foundations for the Proposed TechniquesWe employed a number of different technologies and techniques in our pro-posed methods for improving the management of states transitions. Webriefly touch upon each concept and different options that we had and enu-merate their points of strength and disadvantages. We make an introductionto different options available for tracking DOM mutations and the proxytechnology that we utilized in this project.5.2.1 Tracking DOM MutationsDOM mutations are any changes made to the DOM tree. The mutationsinclude addition, deletion and modification of the content of the elements ofthe DOM and also structural changes to sub-trees of the DOM. Changingthe name of an attribute, deleting a node in the DOM and modifying thetextual content of a node are examples of DOM mutations.Crawljax, the Web application crawler whose performance improvementis the subject of this thesis, crawls Web applications and builds a stateflowgraph of the transitions between different states in a Web application. Forthe purpose of its crawling, Crawljax considers the current state of the DOMtree as the state of the Web application under crawl at any given point oftime. As such, tracking the changes made to the DOM, namely DOM muta-tions, is of high importance in identifying state transitions during crawlingsessions.As Web technologies and standards have evolved over the past decade,different mechanism and APIs have become available to Web developers fortracking DOM mutations. DOM Mutation events, Mutation Observers, andfinally mutation-summary library are main options for tracking DOM mu-tations in developing Web applications. We opt for utilizing the mutation-summary library in this thesis. We justify this choice of technology byexplaining each option in next sections.285.2. Foundations for the Proposed Techniques5.2.2 Mutation EventsMutations events [14] interface which was introduced in DOM Level 2 is amechanism for tracking the changes made to the DOM. As the design of theMutation Events interface was considered to be flawed it is in the processfor being dropped and is considered as depreciated. There are a numberof performance and bug-causing issues associated with this interface: theslowness of the mechanism; triggering too many reports upon the occurrenceof changes to the DOM in a synchronous way; and, leads to development ofcrash-prone Web applications. [15]5.2.3 Mutation ObserversMutation Observers provide another mechanism for receiving notificationsabout changes made to the DOM tree. Mutation Observers are introducedin DOM3 [16] and are designed to replace Mutation Events for trackingalterations of the DOM. Mutation Observers provide call back functions tohandle the changes. This has the advantage of handling multiple changesas a group whereas in the mutation events multiple events are triggered forevery single alteration.5.2.4 Mutation-Summary LibraryMutation-Summary Library [17] is an open-source library implemented inJavaScript and provided by Google R?. Being implemented on top of Muta-tion Observers API, the mutation-summary library provides a more reliableway for tracking the changes made to the DOM. The improvements of thislibrary over the Mutation Observer API include the ease of using its API,and grouping and filtering DOM alterations before reporting.The mutation-summary library works in a way that it takes snapshotsof the DOM in specific intervals and then compares these snapshots in orderto find and report the differences between two subsequent snapshots. Thismeans that the transient changes that dissipate shortly, in such a shorttime that they do not last till the next snapshot, are not reported. As thestate comparison intervals in Crawljax are significantly longer than thoseof Mutation Observers and mutation-summary Library, we are interestedonly in more durable changes that last for a considerable period of time.Hence, the summarizing feature in this technology is very amenable to ouruse-case. However, in use-cases where transient DOM state changes are ofimportance, this library should not be used [18].295.3. Alternative Methods for Tracking State Changes5.2.5 ProxiesA proxy is a concept in computer networking which allows implementingencapsulation in the design of network architectures [19]. A proxy server,is a server intercepting the communications between a client and its server.Intermediating between the client and the server , proxies can access andmodify the request from client before sending it to the server. Likewise,proxies can also access and make alterations to the response from the serverbefore handing it to the client. Proxies are of different types and are usedfor various purposes.Different types of proxies include but not limited to Open Proxies, For-ward Proxies and Reverse Proxies. Providing anonymity on the Internet,filtering, caching and eavesdropping are examples of using proxies. Prox-ies can be implemented in different layers of computer networking. Proxyservers are in the form of software application or a physical device such asa router. In this thesis we set up a software proxy intercepting HTTP andHTTPS communication in the application layer.Having explained the foundations upon which our proposed methods aredrawn, we continue the discussion by presenting the alternative methods wecame up with for improving the time consumption in the crawling.5.3 Alternative Methods for Tracking StateChangesWe aim at bypassing unnecessary comparisons. Unnecessary comparisonshappen when an event is fired on some element but it does not lead to anychange in the DOM-tree, even a single byte. The current crawler, however,performs the three steps mentioned above. We propose the idea of installinga light weight agent in the browser side, instead of in the crawler side, totrack the changes in the DOM-tree. This agent should have the followingcharacteristics:? It must be able to track every single change applied to the DOM.? It should not impose considerable time overhead on the browser andas a result on the crawling process.? It must provide some interface that can be utilized for communicatingfrom the crawler305.3. Alternative Methods for Tracking State ChangesWe came up with three different solutions to build our agent with theaforementioned desired characteristics. However, it is the experimental re-sults that show how efficient the solutions are. Our solutions have threecomponents as it is shown in the figure 5.1. The injector component isresponsible for installing the agent in the browser side. The agent itselfis responsible for tracking DOM mutations and setting the mutation flagwhenever some changes are made to the DOM. Finally a communicationcomponent that retrieving the mutation flag from the agent in the browseris its main responsibility.Generic Solution ComponentsMutation Tracker AgentMutation-FlagRetrieverAgentInjectorFigure 5.1: Generic solution componentsWe present the common approaches and technologies utilized in all al-ternative methods we proposed in the next section. Following the common-alities the differences are presented and each method is explained in details.5.3.1 Shared Characteristics of the SolutionsRegarding the DOM change tracking part we utilized the Mutation-Summarylibrary to get notified of every single change made to the DOM. The com-munication component of the agent was made feasible by introducing newextension points to the crawler. Finally, in order to install the agent, wemade use of three different methods utilizing multiple technologies.The Mutation-Summary library is very flexible in allowing us to definewhat type of changes are of importance to us. As every single change in theDOM-tree has the ability to be translated into a crawler-side state change,we set the mutation tracker capabilities to track all types of change, in-cluding textual and structural changes in forms of additions, deletions and315.3. Alternative Methods for Tracking State Changessubstitutions. In addition we designed a flag for storing the mutation statusof the DOM-tree. This mutation flag is originally reset to false and upon ev-ery single change made to the DOM it is set to true. True means that somechanges have been made to the DOM-tree and hence a full state comparisonis required.A new extension point was introduced to the crawler to enable commu-nications between our agent and the crawler. As it is shown in algorithm3.2, the Crawljax algorithm fires an event on a candidate element, waits fora certain amount of time and then retrieves the DOM-tree from the browser.However, we made it possible to add additional functionalities at the veryexact point before retrieving the DOM-tree from browser. We utilized thisextension point to build the communication part of our agent. What we doat this point is that instead of retrieving a huge DOM-tree from the browser,we retrieve our mutation flag; If the flag is set to true we proceed with theordinary full comparison process, otherwise, assured that the state of theDOM has not been changed we bypass the unnecessary DOM-tree retrievaland comparison steps.Now we move on with explaining each alternative solutions in detailsand present the way they differ and how they are improved incrementallyover one another.5.3.2 Proxy-Based SolutionIn the first solution we utilized a proxy to install our agent in the browserside. The agent is installed in the browser-side by intercepting the infor-mation that is sent by the Web server of the application being crawled.The information sent by the Web server is processed and whenever a newpage is sent to the browser the agent is appended to the page so as totrack the alterations of the DOM-tree. We set up a Web server to host theMutation-Summary library so it can be included as part of our agent. TheMutation-Summary is hooked to the application as an external JavaScript.The communication with the agent is performed utilizing the newly cre-ated extension point explained earlier. The components of the proxy-basedsolution are outlined in figure 5.2.The proxy helps us achieve full control over data exchanged between thebrowser and the server and install our agent smoothly. However, it imposesa considerable latency on processing the information and as a result the timeit saves is less than its overhead. Hence we continued improving the agentin the tow versions that are described in the following sections.325.3. Alternative Methods for Tracking State ChangesProxy-based Solution ComponentsSelenium- backedMutation-FlagRetrieverMutationTracking ScriptsMutation-Summary LibraryProxy-basedAgentInjectorExternalScriptInjectionWeb ServerFigure 5.2: Proxy-based solution components5.3.3 Plugin-Based SolutionAs depicted in figure 5.3, in this version we retained the Web-server and thecommunication point, but we eliminated the proxy component to achievetime efficiency. As such, the role of the proxy, which was installing theagent in the browser side, is carried out by building a new plug-in for thecrawler. This plug-in is attached to the crawler at an extension called On-Page-Load plug-in. Hence, the plug-in installs the agent every time a newpage is loaded in the browser. In addition, whenever the communicationcomponent attempts to retrieve the mutation flag, it checks whether theagent is present in the browser. If the agent is not present, it injects it againso it is installed and ready to function for the next iteration. This version ofsolution works significantly more efficiently than the proxy-based solutionand outperforms the default behavior of the crawler. We, however, aspireto further improve it as explained in the the second plugin-based method.5.3.4 Plugin-Based Solution with In-Memory AgentThe performance achieved by the plugin-based solution realized our goal forimproving the time consumption of the transition management. However,we had an intuition that placing the mutation-summary library on a Web-server and fetching it repeatedly imposes a time overhead on the solution.Hence, we designed another solution in which we eliminated the Web-serveras shown in figure 5.4. In this solution, the installer component holds acopy of the library and injects it directly to the Web page simultaneously as335.3. Alternative Methods for Tracking State ChangesPlugin-based with Server ComponentsSelenium- backedMutation-FlagRetrieverPlugin-basedAgentInjectorExternalScriptInjectionMutationTracking ScriptsMutation-Summary LibraryWeb ServerFigure 5.3: Plugin-based solution components?external injectionthe core of the agent is injected. The difference here is that in the previousversion, a reference to the mutation-summary library was injected in thepage and the browser had to load the actual library itself, but in the newversion the actual library is injected directly. The other components are thesame as the Plugin-based Solution.Internal Plugin-based ComponentsSelenium- backedMutation-FlagRetrieverPlugin-basedAgentInjectorInternalScriptInjectionMutationTracking ScriptsMutation-Summary LibraryFigure 5.4: Plugin-based solution components?internal injectionHaving covered the three enhancements proposed for improving the timeconsumption of the crawling, we present the experiments conducted to eval-uate the three solutions in the next section.345.4. Evaluation5.4 EvaluationWe conducted a series of experiments to answer the research questions thatshaped the objectives of this thesis. In this section we break down the firsthigh level research question into three specific questions. Afterward, wepresent the design of our experiments, the methodology of conducting themand the results achieved from carrying out these experiments.5.4.1 Research Questions Broken DownOur first high level research question targets improving the time consump-tion of the crawling process. Having explained more details about our en-hancements to the state transition management, we delineate RQ1 into morespecific research questions to investigate the performance of our solutions.We have three research questions associated with improving the time perfor-mance of the crawling process. As explained in details earlier, these solutionsare proxy-based, plug-in based with Web server and plug-in based with in-memory injection. Each research question addresses the time performanceof one of these solutions. Hence we state the following sub-questions:RQ1.a : To what extent does the proxy-based solution for managing statetransitions improve the time performance of the currently employedmechanism for managing state transitions?RQ1.b : To what extent does the Plug-in-based solution for managing statetransitions improve the time performance of the currently employedmechanism for managing state transitions?RQ1.c : To what extent does the Plug-in-based Solution with in-memoryagent for managing state transitions improve the time performance ofthe currently employed mechanism for managing state transitions?We will investigate these question and explain the experiments conductedto answer these questions in details in the following sections. We shed lightson how we investigated the effects of replacing the default behavior of thecrawler by our alternative solutions.5.4.2 Experimental Design and MethodologyWe designed a serious of experiments to compare the enhanced versions withthe default version of the crawler to answer our research questions. In eachexperiment we evaluate the enhancements made to the crawler against the355.4. Evaluationdefault behavior of the crawler. In each enhancement proposal additionalfunctionality has been introduced to the crawling algorithm. These enhance-ments help track the mutations of DOM trees to bypass unnecessary com-parisons. On the other hand, these enhancements can potentially introducetime overheads to the system. Hence we devised a number of experimentsto measure the variables pertaining to the efficiency and correctness of thevarious solutions in hand.In these experiments the main variable measured is the time consumptionof default vs. enhanced crawling. However, we measure additional variablesto achieve more insights over the performance and the correctness of the so-lutions. First, we measure the time overhead of the mutation tracker agenti.e. the proposed solution. In addition, we measure the time spent on com-paring identical DOM trees whose comparison is unnecessary. We performedmore measurements whose variables are explained in the following.First and foremost, we measure the time consumption each alternativesolution imposes as an extra overhead on the system. This time is dividedinto two categories. First, there is an overhead associated with installingthe agent in the application. This is the time required to intercept andmodify the responses sent from the server for the proxy-based solution andplug-in injection time for the two other solutions. The second category isthe time each solution needs to retrieve the mutation flag. Furthermore,we measure the time spent on performing unnecessary comparisons whichmust be bypassed. These unnecessary comparisons are bypassed by theenhanced solutions. Another variable that we measure is false negatives. Afalse negative happens when the mutation flag shows no change in the stateof the DOM but actual DOM comparison informs us of mutations in thestate of the application. The total time of each experiment, the size of theDOM-trees, number of events fired, number of bypassed comparisons, andtotal number of states crawled are additional variables that we measure inthe experiments.Experimentation Objects In these experiments we use the top mostvisited Websites from Alexa list. These Web applications crawled as theobjects of the experiments are listed in table 5.1 along with the IDs assignedto them. These IDs are specially used for effiency in using the limited spacein the tables and charts illustrating the experiments results.The dynamics of Web application responses require careful attention indesigning the experiments. One important issue that needs careful attentionis that when performing an experiment on the client side of a Web applica-365.4. EvaluationTable 5.1: State transition management experiments objectsID Web Applicationggle www.google.comwkpd www.wikipedia.comlive www.live.comtwtr www.twitter.comqq www.qq.comamzn www.amazom.comlnkn www.linkedin.comtbao www.taobao.comyndx www.yandex.ruyhoo www.yahoo.comtion, the server side characteristics should be taken into consideration. It isquite possible that inputing the same sequences of events to the GUI of aWeb application result in two different responses from the server. In orderto mitigate the effects of the non-determinism induced by the server sideand the network characteristics, we repeat each experiment five times foreach Web application.We measured the time consumption by means of the System class timemeasurement capability. The number of maximum states for each experi-ment was initially set to 30. However after conducting the experiments weobserved that if some applications are allowed to run for a maximum of 24hours some of them could use up the default heap space and result in a heapspace error. Hence, we changed the heap maximum size and repeated theexperiments with the maximum number of states set to 10 state. The resultsof performing the experiments along with precise definition of the variablesmeasured are presented in the next section.5.4.3 ResultsEach solution is evaluated by crawling all the object Web applications. Foreach experiment, one Web application from the objects is crawled by thecrawler. The time saved by the solution and the time overhead imposedon the crawler by the solution is measured in each experiments. The ex-periments are run five times and the average (arithmetic mean) is takenover the results. Result for the proxy-based, plugin-based with Web server,375.4. Evaluationand plug-in-based solutions are presented in the figures 5.5, 5.6, and 5.7respectively.Designing the experiments, we strove to measure all the meaningful met-rics that are of importance to the evaluation of our proposed methods. Herewe explain these variables before presenting the results in the tables andcharts.Saved Time : The time improvement achieved by introducing the newsolution. That is the time spent on unnecessary comparisons (i.e.The time our solutions save). This includes the time that is spenton communicating with the browser and retrieving the state of theapplication. The additional overheads such as constructing objects forstates are also included in this variable. This is the main variable thatproposed enhancements must maximize.Overhead : The time overhead imposed by the new solution to the crawler.This is the main variable that proposed solutions must minimize.Overhead time is the sum of flag time and proxy time for the proxy-based solution. This is the sume of flag time and plug-in time for theplugin based solutions. These are defined next.Proxy : The time consumed by the proxy. In the first solution which proxyplays a significant role we measure the time overhead of interceptingcommunicants for injecting our agent.Plugin : The time consumed by the ?plug-in? based injections. In the twoplug-in-based solutions we measure the time overhead of injecting ouragent.Flag : The ?time? consumed for retrieving the mutation ?flag?. This in-cludes the time required for communication with the browser, exe-cuting javaScript in the application and retrieving the value of themutation flag.Time : The total time of an experiment. That is the time beginning fromstart of a crawl session until either reaching the maximum number ofstates or quiting the experiment for other reasons such as exhaustingthe state space of the application under crawl.Events Fired : Total number of ?events fired? on a Web application bythe crawler. The crawler fires events such as the click event on specificelements in the applications under crawl.385.4. EvaluationBypassed : The number of comparisons that are bypassed because theagent notified the crawler that no change has been made to the stateof the Web application.States : The number of states crawled in the experiment.FN : The number of false negatives. False negative in this contexts are caseswhere the mutation flag suggests bypassing a comparison because nochange has been reported by the agent, but the actual comparisonreveals that changes have been made to the state of the applicationand have been recognized by the agent.FP : The number of false positives. In this context a false positive happenswhen existence of changes are reported by the agent but the actualcomparison decides the change is not significant enough to be consid-ered as a state change.Having covered the variables and acronyms that are to be used in the resultswe present the results achieved from conducting the experiments in thefollowing section.RQ1.A: Time Performance of the Proxy-Based Solution The proxy-based solution was thoroughly assessed by crawling the experiment objectsmultiple times. Figure 5.5 shows the main variables that were measured inthe experiments:Improvement: The time the proxy-based solution saves by eliminatingunnecessary comparisons and communications.Overhead: The time overhead imposed by the proxy-bases solution.As it is presented in figure 5.5, proxy-bases solution time overhead is morethan the time it saves by improving the crawling process. The new solutiondoes not overtake the default behavior significantly. The new solutions per-forms similar to the default version for four of the the ten cases. The defaultcrawler performs significantly better on six objects out of ten.The proxy-based solution does not improve the time performance of thestate transition management process.Table 5.2 presents the the experiments results in more details. For eachWeb application the time saved by bypassing unnecessary comparisons are395.4. Evaluationggle wkpd live twtr qq amzn lnkd yndx tbao yhoo averageProxy?based Time Improvements vs. Time Overheads Web ApplicationsTime (s)0100200300400500 Saved TimeOverhead TimeFigure 5.5: Proxy-based solution time performance: the time overhead im-posed by the proxy-based solution is much greater than the time it saves bybypassing unnecessary comparisons. Hence, the default crawling performsmore efficiently. The bars represent the averages taken over 5 rounds ofexperiments presented in table 5.2.listed. In addition, the time overheads, one for retrieving the flag and an-other one imposed by the proxy, are shown separately. These two itemsare also summed up together as the total overhead and are listed under the?Time Overhead? title. The total time of the experiments are also includedin table 5.2. The values shown in table 5.2 are averages taken over fiverounds of experiments. Further statistical measures about the results arealso provided; standard deviation in table 5.3, absolute value of coefficientof variation (i.e. relative standard deviation) in table 5.4, maximum valuesis table 5.5, and minimum values in table 5.6.Table 5.7 presents the detailed data about the number of events fired onexperiment objects and how many of these events resulted in new states. Italso includes how many times the agent bypassed unnecessary comparisons.The column, ?Events Fired? shows the total number of events fired on theapplication. In the default crawler, each fired event results in a comparisonto check whether the state of the application is changed or not. However, theproposed solution bypasses a number of these comparisons. These measurednumbers of bypassed comparisons are listed in the ?Bypassed? column. Thetotal number of states crawled in the experiments are included in the ?State?405.4. EvaluationTable 5.2: Proxy-based time consumption (milliseconds); the proxy-based?time overhead? is much greater than the time improvements i.e. ?savedtime?.ID Saved Time Proxy Time Flag Time Time Overhead Total Timeggle 138.2 8931.6 130 9061.6 57113.4wkpd 18601.4 83642.8 22008.2 105651 243323.4live 6851.4 1606.6 739.4 2346 97208.2twtr 4359.8 1558.8 496.8 2055.6 202418.8qq 0 434564.8 97.4 434662.2 951066.2amzn 47224.2 97183.8 914.8 98098.6 341746.6lnkd 6087.8 30472.2 949.6 31421.8 114203yndx 0 58768.8 91 58859.8 67407.2tbao 132.6 332581.8 97.2 332679 616755.6yhoo 327 5152.2 46.6 5198.8 26142.6average 8372.24 105446.34 2557.1 108003.44 271738.5column. False negative and false positives are also presented in ?FN? and?FP? columns receptively.RQ1.B: Time Performance of the First Plugin-Based Solution Thefirst pulgin-based solution, which utilizes a Web server for hosting the mutation-summery library is assessed for its time performance. We present the resultsof the experiments in figure 5.6. We performed the experiments on the ob-ject Web applications to compare the time performance of the proposedplugin-based solution against the default crawling process. In these exper-iments multiple variables are measured. However, the main two variablesthat are present in figure 5.6are as following:Improvement: The time the solution saves by eliminating unnecessarycomparisons and communication steps when there is no change in thestate of the Web application after firing an event.Overhead: The time overhead imposed by the plugin-based solution.As it is shown in figure 5.6 the sever-utilizing plugin-based proposedsolution performs more efficiency in the carrying out the state transition415.4. EvaluationTable 5.3: Standard deviations of proxy-based time consumptionID Saved Time Proxy Time Flag Time Time Overheadggle 9.44 973.08 5.24 977.29wkpd 2470.24 6258.46 2855.96 8666.19live 1039.96 157.61 82.49 205.57twtr 406.34 87.45 50.11 76.06qq 0.00 22790.28 6.27 22786.71amzn 4365.87 14216.21 104.32 14300.70lnkd 891.01 3187.10 137.90 3298.76yndx 0.00 4028.57 12.88 4036.75tbao 12.30 27018.00 9.26 27025.11yhoo 36.93 290.98 3.71 290.22management tasks. As it is depicted in the bar chart in figure 5.6, on averagethe proposed solution consumes less time than the default crawling process.In addition, in half of the cases the proposed solution consumes considerablyless time than the time the default crawler spends on performing unnecessarycomparisons. It is only in one case that the proposed solution is more of anoverhead than an improvement, and in four cases out of ten the differenceis not very significant.The plugin-based solution, utilizing a web server, improves the time perfor-mance of the state transition management process by 253.34%. That meansthe time consumption ratio of default to proposed is 3.5334, which meansthe time consumption is reduced by 71.69%.Aside from figure 5.6 which shows the gist of the results of the exper-iments, we present the details of the data gathered in the experiments onthe the sever-utilizing plugin-based proposed solution in table 5.8. As it isshown in table 5.8 time improvements of the solution in the experimentsare listed by Web application. The improvements are listed in the ?SavedTime? column, and the time overhead is provided as a total number underthe ?Time Overhead? title. The time overhead is also broken down in to itsconstituents, plugin time and mutation flag retrieval time in columns ?Plu-gin Time? and ?Flag Time? respectively. The values presented in table 5.8lists the averages taken over results of five rounds of experiments. Further425.4. EvaluationTable 5.4: Relative standard deviations percentages (absolute values ofcoefficient of variation) of proxy-based time consumptionID Saved Time Proxy Time Flag Time Time Overheadggle 6.83 10.89 4.03 10.78wkpd 13.27 7.48 12.97 8.20live 15.17 9.81 11.15 8.76twtr 9.32 5.60 10.08 3.70qq Not defined 5.24 6.43 5.24amzn 9.24 14.62 11.40 14.57lnkd 14.63 10.45 14.52 10.49yndx Not defined 6.85 14.15 6.85tbao 9.27 8.12 9.52 8.12yhoo 11.29 5.64 7.97 5.58statistical measures about the results are also provided; standard deviationin table 5.9, absolute value of coefficient of variation (i.e. relative standarddeviation) in table 5.10, maximum values is table 5.11, and minimum valuesin table 5.12.Table 5.13 sheds more lights on the details of how the aforementionedtime performance improvements are achieved by the sever-utilizing plugin-based proposed solution. The variables presented in this table have the samemeaning as those presented in table 5.7.RQ1.C: Time Performance of the Second Plugin-Based SolutionThe second plugin-base solution, which does not utilize a Web server forhosting the mutation-summary library and instead injects it directly, wasalso assessed thoroughly in the same manner as its two predecessors. Wepresent the gist of the experiments carried out on the ten object Web appli-cations in figure 5.7. As it is shown in the bar chart, the second plugin-basedsolution performs more efficiently than the default crawling process. As itis shown in the average bars, the proposed solution consumes less time thanthe default crawler on average. Moreover, in six cases out of ten, the pro-posed solution takes over the default crawler. It is only in one case that theproposed version consumes significantly more time than the default version.435.4. EvaluationTable 5.5: Maximum values of proxy-based time consumption (millisec-onds)ID Saved Time Proxy Time Flag Time Time Overheadggle 152 10273 137 10406wkpd 20647 9 7 24209 115775live 7468 1783 820 2603twtr 4664 1683 568 2140qq 0 469333 103 469427amzn 51946 107874 6 108880lnkd 6757 32909 1054 33953yndx 0 63470 101 63569tbao 147 369165 107 369267yhoo 377 5615 52 5661The plugin-based solution, not utilizing a Web sever, improves the process ofstate transition management by 197.48%. That means the time consumptionratio of default to proposed is 2.9748, which means the time consumption isreduced by 66.38%.In addition to figure 5.7 which presents the essence of the experimentsresults, we shed more lights on the details of the time consumption by thesecond plugin-based solution in 5.14. The variables presented in this tablehave the same meaning as those in table 5.8. Here again, the time overheadis presented in two ways. First, it is broken down to its two components,the plugin time and the flag time. In addition, it is presented as the sumof its two constituents in the column ?Time Overhead?. The time improve-ments of the proposed solutions is listed in the column ?Saved Time?. Theaverages taken over the results of five rounds of experiments are listed intable 5.8. Further statistical measures about the results are also provided;standard deviation in table 5.15, absolute value of coefficient of variation(i.e. relative standard deviation) in table 5.16, maximum values is table5.17, and minimum values in table 5.18.Table 5.19 presents the data achieved from experiments which illustratesin details the manner in which the second plugin-based solution bypassesunnecessary comparisons and improves the time efficiency of the crawlingprocess. The variables presented in this table are the same as those presentedin tables 5.13 and 5.7.445.4. EvaluationTable 5.6: Minimum values of proxy-based time consumption (milliseconds)ID Saved Time Proxy Time Flag Time Time Overheadggle 129 8038 124 8164wkpd 14511 75281 17168 92449live 4 1445 609 2105twtr 3665 1449 442 1965qq 0 408490 88 408591amzn 40142 72889 770 73659lnkd 4569 24990 742 25732yndx 0 55830 69 55901tbao 118 299326 85 299411yhoo 294 4894 42 4942As presented in the tables and figures so far, in summary, the proxy-basedsolution does not improve the time performance of the state transition man-agement process. The first plugin-based solution, utilizing a web server,improves the time performance significantly. On average the first plugin-based solution improves the time performance by 253.34%. That meansthe time consumption ratio of default to proposed is 3.5334, which meansthe time consumption is reduced by 71.69%. The third solution, which isplugin-based and does not utilize a Web sever, improves the process of statetransition management by 197.48%. That means the time consumption ra-tio of default to proposed is 2.9748, which means the time consumption isreduced by 66.38%.We will elaborate more on the results when revisiting research questionsin the discussion chapter. Having addressed state transition managementimprovements we continue with presenting the improvements made to thescalability of the crawling in the next chapter.455.4. EvaluationTable 5.7: Proxy-based bypassed comparisons: the detailed information onevents fired, comparisons that were categorized as unnecessary, false nega-tives and false positives.ID Events Fired Bypassed States FN FPggle 10 1 10 0 0wkpd 169.8 154.2 10 4.2 6.6live 72.6 62.4 10 8.8 1.2twtr 50.8 41 10 8.6 0.8qq 9 0 10 0 0amzn 92.8 83.2 10 0 0.6lnkd 100 90.4 10 0 0.6yndx 9 0 10 0 0tbao 10 1 10 0 0yhoo 17.8 7.6 10 0 1.2ggle wkpd live twtr qq amzn lnkd yndx tbao yhoo averageExternal?JS?Plugin?based Time Improvements vs. OverheadsWeb ApplicationsTime (s)051015202530 ImprovementOverheadFigure 5.6: External js-plugin-based solution times: the first plugin-basedsolution, which employs a Web server and injects scripts externally, overtakesthe default process. The time improvements on average is much greater thanthe overheads imposed. The bars represent the averages taken over 5 roundsof experiments presented in table 5.8.465.4. EvaluationTable 5.8: External js-plugin-based solution times (milliseconds): the timeimprovements by the first plugin-based solution is much greater than thetime overheads it imposes on the crawling process.ID Saved Time Plugin TIme Flag Time Time Overhead Total Timeggle 0 136.8 190.8 327.6 57113.8wkpd 12846 144.6 22757.2 22901.8 243323.6live 6908.6 727.4 1211.8 1939.2 97208.4twtr 6461.8 147.4 1106.6 1254 202418.8qq 0 101.8 159 260.8 951066amzn 22757.2 120.2 939.2 1059.4 341746lnkd 3066.2 96.4 1432.8 1529.2 114203yndx 0 109 144 253 67407.2tbao 0 412.2 175.2 587.4 616755.8yhoo 2400.4 509 424.4 933.4 26142.6average 5444.02 250.48 2854.1 3104.58 271738.52Table 5.9: Standard deviations of external-js plugin-based time consumptionID Saved Time Plugin Time Flag Time Time Overheadggle 0.00 12.76 26.54 22.57wkpd 1808.80 6.77 2015.18 2004.42live 691.93 53.09 160.99 122.64twtr 1078.02 15.06 127.21 76.19qq 0.00 8.17 14.71 16.62amzn 1354.61 9.65 68.53 76.86lnkd 353.50 10.45 75.71 111.24yndx 0.00 8.46 6.44 24.64tbao 0.00 34.71 24.94 62.02yhoo 184.15 48.40 33.26 83.67475.4. EvaluationTable 5.10: Relative standard deviations percentages (absolute values ofcoefficient of variation) of external-js plugin-based time consumptionID Saved Time Plugin Time Flag Time Time Overheadggle Not defined 9.32 13.90 6.88wkpd 14.08 4.68 8.85 8.75live 10.01 7.29 13.28 6.32twtr 16.68 10.21 11.49 6.07qq Not defined 8.02 9.25 6.37amzn 5.95 8.03 7.29 7.25lnkd 11.52 10.84 5.28 7.27yndx Not defined 7.75 4.47 9.73tbao Not defined 8.42 14.23 10.55yhoo 7.67 9.50 7.83 8.96Table 5.11: Maximum values of external-js plugin-based time consumption(milliseconds)ID Saved Time Plugin Time Flag Time Time Overheadggle 0 151 211 356wkpd 14130 151 25032 25882live 8085 785 1457 2058twtr 8337 160 1329 1354qq 0 111 178 279amzn 24122 130 1023 1154lnkd 3372 114 1506 1682yndx 0 119 151 280tbao 0 467 192 628yhoo 2664 554 465 1036485.4. EvaluationTable 5.12: Minimum values of external-js plugin-based time consumption(milliseconds)ID Saved Time Plugin Time Flag Time Time Overheadggle 0 124 147 294wkpd 9765 134 20029 20840live 6286 654 1090 1764twtr 5751 130 1029 1181qq 0 91 141 239amzn 20484 110 854 985lnkd 2579 88 1332 1391yndx 0 97 135 225tbao 0 379 131 478yhoo 2184 440 386 813Table 5.13: External js-plugin-based bypassed comparisons: the detailed in-formation on events fired, comparisons that were categorized as unnecessary,false negatives and false positives.ID Events fired ByPassed States FN FPggle 9 0 10 0 0wkpd 170.6 158.4 10 1 3.2live 72.2 62.6 10 0 0.6twtr 69.4 59.8 10 0 0.6qq 9 0 10 0 0amzn 64.6 54 10 0 1.6lnkd 100.4 91 10 0 0.4yndx 9 0 10 0 0tbao 9 0 10 0 0yhoo 18.8 9.8 10 0 0495.4. Evaluationggle wkpd live twtr qq amzn lnkd yndx tbao yhoo averageInternal?JS?Plugin?based Time Improvements vs. Overheads Web ApplicationsTime (s)051015202530 ImprovementOverheadFigure 5.7: Internal js-plugin-based solution times: the time the secondplugin-based solution improves is greater than the time overheads it im-poses on the process. hence, the second plugin-based solution also overtakesthe default state transition management system. The bars represent theaverages taken over 5 rounds of experiments presented in table 5.14.Table 5.14: Internal js-plugin-based solution times (milliseconds): the sec-ond plugin-based solution overtakes the default process. The time it im-proves is greater than the time overhead it imposes on the crawling process.ID Saved Time Plugin Time Flag Time Time Overhead Total Timeggle 0 163.8 160.6 324.4 112488.6wkpd 12516.6 169.4 22748.8 22918.2 196764.2live 7024.2 261.4 1063.4 1324.8 96941.2twtr 4144.8 123 787.8 910.8 190713.8qq 0 147.6 133.4 281 233494.2amzn 12997.6 459.6 558 1017.6 131194.4lnkd 3046.6 127 1387 1514 115787yndx 0 102.6 179.2 281.8 65686.2tbao 546.2 245.8 190.4 436.2 550140.2yhoo 1941 290 296 586 650855average 4221.7 209.02 2750.46 2959.48 234406.48505.4. EvaluationTable 5.15: Standard deviations of internal-js plugin-based time consump-tionID Saved Time Plugin Time Flag Time Time Overheadggle 0.00 19.28 16.29 14.12wkpd 1914.28 30.95 1883.50 2698.28live 629.92 29.74 78.32 80.70twtr 192.48 9.67 92.94 68.77qq 0.00 18.15 6.99 25.42amzn 656.91 37.29 43.82 141.02lnkd 314.51 5.34 44.67 152.09yndx 0.00 12.34 9.81 17.43tbao 33.74 32.67 17.53 35.49yhoo 191.88 18.49 23.44 35.89Table 5.16: Relative standard deviations percentages (absolute values ofcoefficient of variation) of internal-js plugin-based time consumptionID Saved Time Plugin Time Flag Time Time Overheadggle Not defined 11.77 10.14 4.35wkpd 15.29 18.26 8.27 11.77live 8.96 11.37 7.36 6.09twtr 4.64 7.86 11.79 7.55qq Not defined 12.29 5.23 9.04amzn 5.05 8.11 7.85 13.85lnkd 10.32 4.20 3.22 10.04yndx Not defined 12.02 5.47 6.18tbao 6.17 13.29 9.20 8.13yhoo 9.88 6.37 7.91 6.12515.4. EvaluationTable 5.17: Maximum values of internal-js plugin-based time consumption(milliseconds)ID Saved Time Plugin Time Flag Time Time Overheadggle 0 195 176 347wkpd 15774 224 24571 27275live 7726 300 1148 1406twtr 4393 135 899 992qq 0 178 142 311amzn 13517 505 619 1243lnkd 3381 135 1456 1665yndx 0 123 193 309tbao 603 300 217 473yhoo 2195 316 324 623Table 5.18: Minimum values of internal-js plugin-based time consumption(milliseconds)ID Saved Time Plugin Time Flag Time Time Overheadggle 0 149 133 308wkpd 11139 150 20701 20397live 6321 237 967 1218twtr 3937 110 701 828qq 0 134 128 252amzn 12217 409 510 905lnkd 2711 120 1332 1349yndx 0 91 167 262tbao 518 218 171 396yhoo 1766 266 266 533525.4. EvaluationTable 5.19: Internal js-plugin-based bypassed comparisons: the detailed in-formation on events fired, comparisons that were categorized as unnecessary,false negatives and false positives.ID Events fired ByPassed States FN FPggle 9 0 10 0 0wkpd 170.2 158.4 10 1.2 2.8live 73.2 63.4 10 0 0.8twtr 49.6 40 10 0 0.6qq 9 0 10 0 0amzn 34.6 25.4 10 1.8 0.2lnkd 100.4 91.2 10 0 0.2yndx 9 0 10 0 0tbao 13.8 4.8 10 0 0yhoo 16.6 7 10 0 0.653Chapter 6Increasing the Number ofStates CrawledIn this chapter, we make the case for improving the scalability of crawlingmodern Web applications. As mentioned earlier, crawling modern Web ap-plications is a resource intensive process. In particular, managing storageand retrieval of states of Web applications has the potential to overwhelmthe memory of the crawler. The memory required for storing the states? datais likely to be of considerable size in modern Web applications. This mainlydepends on the average size of the states as well as the number of statesin the application under crawl. As the number of states in the state ma-chine increases, the memory allocated to the state machine grows larger andlarger. This continues up to a point where no further memory allocation ispossible because all memory available to the crawler is already allocated. Atthis point, the crawler breaks and the crawling process stops. To tackle thisproblem, we aimed at improving the memory utilization so that a ?greaternumber of states can be crawled? before all memory is allocated.In this chapter, first we briefly explain the background knowledge andfoundations on which the proposed enhancements are laid. As such, we touchupon the way memory management is achieved in Java because Crawljax,the crawler on which we examined our abstract ideas, is implemented in Java.This is done to pave the way for explaining the proposed improvements,the experiments, and the memory analysis. The memory management isfollowed by explaining how we chose a graph database for application inour solution. In fact, in the process of realizing our ideas into functionalmeasurable capabilities, we needed to select a graph database for creating analternative solution for improving the memory consumption in the crawlingprocess. Hence early in this chapter, we explain how the comparison is doneamong all viable options for selecting a graph database.In what follows, we continue with touching upon the foundation of mem-ory management to help present the suggested improvements, the memoryanalysis and the experiments.546.1. Memory Management6.1 Memory ManagementThe way memory is allocated and freed in the crawler does not solely dependon the way the tool is implemented. In fact, it is only the memory allocationthat is performed directly in the manner the design and implementation ofthe crawler dictates. However, the deallocation part of the memory manage-ment is performed automatically by lower level layers of the Java technologywhich complicates memory measurements and optimizations in this project.Hence we provide a brief introduction to the founding concepts of memorymanagement in Java Virtual Machine on which the crawler is executed.6.1.1 Java Virtual Machine HeapJVM heap is the part of JVM memory in which all class instances and arraysreside. All threads in a JVM share the same heap. JVM heap is createdat the beginning when the JVM starts up. The heap can be of either avariable or an invariable size. In the former case, the heap can increase insize if the application requires more memory. Likewise, the heap size canalso be reduced if the allocated heap is too large for the current consumptionrequirements of the application [20, 21].There are also a number of JVM options that can be set (e.g. as com-mand line arguments) to specify the initial, minimum, and maximum sizeof the JVM heap. Aside from being limited by the maximum size option(-Xmx), the size of the physical memory available to the underlying systemon which the JVM runs is an upper limit for the heap size.JVMs can also implement memory management processes to elevate theefficiency of memory utilization. These management processes are some-times referred to as garbage collection (GC) processes. In order to makeroom for creating new objects, memory management processes remove theobjects that are not necessary to be detained in the memory anymore.However, if an application consumes the heap memory up to a degreethat there is no more space left for creating further required objects, the outof memory error happens for the JVM and the application execution stops:If a computation requires more heap than can be made availableby the automatic storage management system, the Java VirtualMachine throws an OutOfMemoryError.? JVM Specification [21]556.1. Memory Management6.1.2 Garbage CollectionGarbage collection or memory management in Java is the process of re-moving the objects that will not be used in the future so as to to prepareallocatable space for new objects in the memory. Contrary to C and C++languages which require the programmer to be in charge of object deallo-cation, Java takes care of clearing memory from unused objects. In whatfollows, memory management concepts such as the nursery, young genera-tion and old generation are described to shed light on how JVM performsthe memory management task.As it is depicted in figure 6.1, there are two significant divisions in theJVM heap: the nursery (or young generation) and the old generation. Thenursery is the part of the heap where new objects are allocated. Whenthe nursery is full the memory management process performs a garbagecollection in the nursery to make room for new objects. At the same time,the objects that are still in use, and as a result are not collectible, arepromoted to another part of the heap which is called old generation. Thegarbage collection on the nursery is called young or minor collection. Whenthe old generation space is full the garbage is collected in that place too.The GC on old generation space is called old or major collection.Figure 6.1: Java heap divided into generations and the runtime options.Source: [2].The concept of having a nursery in the heap is introduced to improvethe performance of garbage collection. As ?most objects are temporary andshort-lived,? [21] a two generational heap allows implementing two types ofGC to improve the efficiency. As the most recently allocated objects are inthe nursery, the GC required for the nursery could generally be much fasterthan the GC of the old generation space or the GC of a single-generational566.2. Criteria for Alternative Solutionsheap (a heap that treats old and new objects in the same manner in termsof collecting them).Another improvement in the garbage collection process is that the JVMreserves a part of nursery for the most recent objects. This area is calledthe keep area and it is exempted from the first upcoming young collection.The reason behind this is that the objects that are just created in the keeparea are very likely to survive the first next young collection and hencebe promoted to the old generation. The problem here is that once theobjects are promoted to the old generation space, they become immune toyoung collection and hence are kept in memory till the next major garbagecollection. However, the keep area objects are either cleared or promotedby the second upcoming young collection.Having covered the foundation of memory management in Java whichwill help clarify the memory related concepts throughout the chapter, wecontinue with discussing the qualities identified as requisite for alternativemethods aiming at improving the memory consumption.6.2 Criteria for Alternative SolutionsOptimizing the memory utilization is one of the goals of this work. As men-tioned previously, there are multiple drivers that cause extensive memoryconsumption in modern Web application crawling. We, however, focus onthe memory requirements driven from managing the storage of states in thecrawling process. If the state space of a Web application is of considerablesize and states of the application are sizable on average, the crawling processruns out of memory at some point and stops crawling. In order to mitigatethis limitation, we propose our ideas for improving the scalability of thecrawling process.Investigating the scalability problem of the crawling process, we put for-ward the idea of creating an alternative method for storing the sate machinecontaining the states of the application. As the state machine is concurrentlyaccessed and it plays a significant role in the crawling process, our alternativesolutions must deliver the following qualities:? It must provide safe concurrent write and read capabilities for storingand retrieving states.? It must not impose a considerable time requirement on the crawlingprocess.576.3. Criteria for Selecting a Graph Database? It must have the flexibility for designing a suitable graph structuresimilar to the structure of the state machine which is a graph data-structure in the crawler.? The memory overhead imposed by the execution of the solution and ofthe communications with it must not defy the purpose of the solution.That is, it must use less memory than the default crawler.Taking the above characteristics into consideration, searching for solutionsto achieve our goal of improving memory consumption, we put forward theidea of utilizing a graph database.Graph databases are optimized for utilization in applications whose datacan naturally be presented in a graph like form. In addition, some graphdatabases provide transaction management and multiple access capabilitiesthat are able to handle concurrent accesses. However, there are many graphdatabases, with various complexities and capabilities. We have to conduct acomparison to decide which graph database is most suitable for being usedin our solution. Furthermore, the time and memory overheads imposed bythe selected database will be the focus of our research questions, and theywill be extensively investigated in our experiments.In order to perform this comparison, we need to set out the criteriafor performing the comparison including what functionalities need to besupported and what qualities must be held by the database. In the followingsection we explain how we extract the criteria for selecting the database byreverse engineering the crawler.6.3 Criteria for Selecting a Graph DatabaseBased on the investigation of Crawljax, we discovered the required function-alities that the alternative solutions must provide. A preliminary researchon the concepts and characteristics of graph databases also helped find-ing a number of additional criteria for comparing current graph databases.We wield the criteria to sift the available options and choose the right graphdatabases that fulfills our requirements the best. The guidelines for choosingthe database is explained in what follows. They include required functional-ities, licensing, partial locks, the ability to have complex nodes in the graph,and productivity and maintenance issues such as offering an API for Javaand Maven public repository availability.586.3. Criteria for Selecting a Graph Database6.3.1 Required FunctionalitiesBased on the investigation of the source code of Crawljax, we deduce thatany alternative solutions we provide for the Crawljax state-storage and re-trieval mechanism have to provide the following functionalities:? Concurrent reading from the stateflow graph? Concurrent writing to the stateflow graph.? Insuring uniqueness when adding states.? Insuring the presence of the start and end nodes when adding an edgein the stateflow graph.? Managing the count of states already in the graph in a block-free andthread safe manner.? Feasibility of implementing a directed multigraph data structure.? Finding the previous state of a given state in the state-machine.? Calculating the path to a given state from the index state.? Getting all outgoing edges from a node.? Finding all incoming edges to a node.? Finding outgoing nodes from a given node or at least Finding thetarget node of an edge.? Finding the target node of an edge.? Checking if an edge exists between two nodes.? Calculating the shortest paths from a given vertex to another vertex,ideally by Dijkstra.? Finding all nodes? Finding all edgesWe also identified a number of implicit requirements that we needed tofulfill. These requirements, extracted from investigation of the crawler logicand code are presented briefly as follows. First, we should be aware thatwhenever changes are made to the state of the application under crawl, the596.3. Criteria for Selecting a Graph Databasestate must be compared to all previous states in the stateflow graph. Hencewe need to have a method for performing these comparisons to ensure theuniqueness of states added to the stateflow graph. However, if the statesare stored as primary keys in the database, or if the database does notallow duplicate vertices, the database itself might have an optimized way ofperforming the comparisons. On the other hand, we might not have to storethe whole state object in the stateflow graph and instead it might suffice tostore the IDs of the states or their hashed representations. In that case, wehave to provide mechanism for linking these states? keys to actual states?data.6.3.2 LicensingIt is very important for us to ensure that the licensing characteristics of thegraph database matches the Crawljax open source licensing. In particular,we consider if it is a propriety software or it is available under some typeof open-source license too. As Crawljax itself is offered under the generousApache version 2 license, we would like to avoid imposing any limitations onusers of the crawler. To that aim, we avoid choosing more restricted licensesand proprietary licenses.6.3.3 Partial LocksIn the default version of the crawler, the methods that add states to themultigraph are synchronized over the entire stateflow graph. Thus when acrawling node is examining the uniqueness of a new state or when a crawlingnode is adding the new state to the graph, the whole stateflow graph islocked. What is ideal for us is to find a graph database that is able toprovide a facility to update, add or delete states and edges without lockingthe entire graph. Hence ability to lock the database partially is of criticalimportance to us in the selection of a graph database.6.3.4 Storing Objects in Nodes and EdgesCrawljax stores various information in nodes and edges of the stateflowgraph. As the states and the navigational paths between them comprisemultiple objects, our database must be able to store these objects or atleast a transformation of the objects in its nodes and edges.606.4. Comparison of Graph Databases6.3.5 API for Java and Maven AvailabilityAs Crawljax is completely implemented in Java and it utilizes Maven de-pendency management facilities, we prefer to use a database that providesa Java API and it can be added as a dependency to Crawljax via publicmaven repositories.Having covered the criteria and guidelines for selecting a graph databasefor application in our solution, we go on with presenting the actual compar-ison and selection of the database in the next section.6.4 Comparison of Graph DatabasesWe conducted a comparison among the available graph databases in orderto select the most applicable graph database for our application. The com-parison is based on the criteria listed in section 6.3. In addition, to thosecriteria, we considered the design flexibility and the maintenance implica-tions of our design as well. In particular, although some of the triple storesseemed to be promising in terms of the maturity of the software product, wedecided not to proceed with triples stores. The reason was to avoid impos-ing unnecessary complexity to the crawler system. This complexity wouldbe introduced by having to transform a graph like data to a set of complexrelationships that could be shaped into triples. Hence, to eliminate the needfor this transformation, we omitted triple stores from our final choices.Table 6.1 shows the comparison of the databases that we consideredfor utilization in our solution. The shortenings used in the table have thefollowing meanings:Funct. : The functionalities listed in the guidelines in section 6.3.1J & M : API for Java and Maven public repository availability.SCONE : Storing complex objects in nodes and edges.GDB : Graph database name.PL : Partial locks.As this comparison is performed in a more qualitative than quantitativemanner, we use a three level scale for each criteria. These levels are depictedas the background color of the cells of the table. These levels are shown inthe table 6.2.616.4.ComparisonofGraphDatabasesTable 6.1: Graph databases comparison: Neo4j emerged to be the most suited graph to our application.GDB Licensing J & M PL Funct. SCONENeo4j GPLv3 Yes Yes Yes YesFlockDB APL2 Not Maven Yes No NoGraphDB APL2 Not Maven Yes Yes HyperGraphAllegroGraph Proprietary Not Maven Yes Partially TripleStoreBigData GPLv2 Yes Yes Yes RDFFilament BSD Poor Yes Poor NoGraphbase Proprietary Not Maven Yes Yes RDFHyperGraphDB LGPL Yes Yes Complex HyperGraphInfiniteGraph Proprietary Yes Yes Yes YesInfoGrid AGPLv3 Yes Yes Partially NoTitan APL2 Immature Yes Yes Yes626.5. Scalable SolutionTable 6.2: Graph databases comparison scale levelsLevel ColorAcceptable Greenaverage YellowNot very acceptable RedBased on the comparison presented in the table 6.1, it emerged thatNeo4j is the graph database most suited for utilization in our application.As it is shown in the table 6.1 this graph database satisfies all our desiredcharacteristics. Especially its extensive locking options is very useful forconcurrent access requirements of the crawling process. It also providesan optional multilevel caching that can optimize the time performance ofthe crawling. It is implemented in Java and is available in Maven publicrepository. Furthermore, its open-source license fits well with the Crawljaxlicense.Having covered the foundations on which the scalable solution is based,the discussion is continued by presenting our ideas for improving the numberof states crawled in the next section.6.5 Scalable SolutionHaving selected Neo4j as our choice of graph database, we built an alter-native solution for storage and retrieval of the state-machine. Aiming atfreeing the memory needed for storing the state machine, we utilized thegraph database to delegate all the state storage functionalities to secondarymemory. We utilized the graph database capabilities to provide all theprevious functionalities of the in-memory stateflow graph. In addition, wewrapped the solution in the very same interface as that of the in-memorystateflow graph. Introducing a new component to the crawler, we aimedat increasing the scalability of crawling by exploring a greater number ofstates.Handling concurrent accesses to the state-holder data structure is one ofthe important aspects of the crawling algorithm. Our alternative solutionhandles multiple concurrent accesses by utilizing the database?s transactionmanagement capabilities. The Neo4j graph database transactions supportthe ACID properties and ensure the safety of concurrent read and writeoperations.636.5. Scalable SolutionAs the API we provide for the proposed state storage solution is thesame as the default one, the changes are encapsulated only within the scopeof the storage component. However, as one of the internal differences of theproposed solution is that it has to serializes the live objects into persistedbyte arrays so as to be compatible with the graph database acceptable datatypes. Hence, we changed numerous classes with minor alterations to ful-fill the requirement for serializability. Bearing in mind that introducing adatabase component to the system brings about time and memory trade-offs,we describe them in the following.6.5.1 Memory Performance Trade-OffsDelegating the management of the states and transitions between the statesto a component that predominantly does not use main memory for storingdata frees a considerable amount of space in the main memory. On theother hand, utilization of a database introduces an additional memory over-head. The database (which is not an in-memory one) requires significantlyfrequent IO operations. These IO operations are memory consuming. Inaddition, we serialize the states and transitions before storing them in thedatabase. Likewise, we deserialize them before making them alive and usethem in the crawling process. The serialization and deserialization processesare of considerable memory overheads. The transactions conducted by thedatabase cause substantial memory overhead as well.6.5.2 Time Performance Trade-OffsIn the current status of the crawler, the whole state machine is locked whena process wants to add a new state to the data structure. This puts otherprocesses in hold state and increases the time consumption of the crawling.In the scalable solution however, partially locking transactions expedite theprocess of crawling by eliminating the need for locking the entire state-machine. On the other hand, the scalable solution needs to communicatewith the hard disk very frequently. This means the additional latency of theIO operations has the potential to impose a significant time overhead on thecrawling. It is however the evaluation that answers how significant it canbe. In the next session we discuss the evaluation of the scalable solution.646.6. Evaluation6.6 EvaluationIn the scalable crawling solution, we aimed at increasing the number ofstates crawled by improving the memory performance of the crawling. Tothis aim, we utilized a graph database to free the main memory from beingallocated for state storage purposes. We aimed at answering the researchquestions by conducting experiments addressing the correctness, memoryperformance and time performance of the solution.We conducted a series of experiments to answer the research questionsthat shaped the objectives of this work. In this section we present the designof our experiments, the methodology of conducting them and the resultsachieved from carrying out these experiments. We discuss the evaluation ofthe scalable crawling solution and results of the related experiments in thefollowing.6.6.1 Research QuestionsConsidering the time and memory trade-offs of the scalable solution, weare interested in investigating how our solution performs in real crawlingsessions. As such, we designed a number of research questions to shed lighton the applicability of our solution. We defined three research question topinpoint the goals of the enhancements carried out in this work. Theseresearch questions address the correctness, scalability and performance ofthe proposed solution compared to the default version of the crawler.In particular we address three different aspects of the proposed solutionfor improving the scalability of crawling. We investigated if the solutionperforms the task of crawling in a correct manner. In addition, we measuredhow scalable the new solution works compared to the default version of thecrawler. The time performance of the alternative solution was investigatedtoo. In addition, new research question emerged from witnessing interestingobservations during the experiments.To delve into the goals of the work, here we break down the secondresearch question into three more specific research questions to pinpoint theobjectives of the research.Correctness The first sub-question addresses the correctness of our graphdatabase based solution. This is indeed important because the state-machineplays a very critical role in the crawling process. Especially taking intoaccount the concurrent aspects of the crawler.656.6. EvaluationRQ2.a : Does (and to what extent) the proposed solution perform the taskof crawling a Web application correctly?This research question, inherently, dictates a type of experiment that is verysimilar to what is generally carried out in the process of testing softwareartifacts. Thus, we need to define what correctness means in this context.In fact, similar to software testing, we need an oracle to decide whether thetask of crawling is performed correctly or not.Scalability: Memory Performance The second subquestion addressesthe scalability of the new mechanism for managing states storage and re-trieval.RQ2.b : To what extent can the proposed crawler crawl Web applicationsin a more scalable manner than the default version of the crawler does?In the context of this work, we define the scalability of a crawler as the abilityto crawl and store more states within a given Web application utilizing acertain amount of main memory. Hence, this RQ aims to determine whetherthe proposed version is able to overtake the the default version in crawling alarger number of states within a Web application while the maximum Javaheap space is limited to a certain amount of memory.Time Performance In the third sub-question we target the time con-sumption of the proposed solution compared to the default crawler.RQ2.c : How does the graph database-backed solution perform in crawlingWeb applications in terms of time consumption?This research question examines in particular the time the proposed versionrequires to crawl a Web application compared to the default version of thecrawler. We are interested to see which version crawls a specific numberof states in a Web application in a shorter amount of time, given a limitedamount of memory.Having stated and explained the research questions highlighting the ob-jectives of the work, we move on with the discussion of how we pursued theanswers to these questions by designing the experiments.6.6.2 Experimental Design and MethodologyCorrectness Experiments As the very first step, the correctness must bedefined in the context of this work. In order to achieve this goal, we need to666.6. Evaluationfind a baseline for assessing our new solution. This baseline is naturally theoriginal crawler. Hence, we assume that the default version of the crawlercrawls a Website correctly. As such, correctness, in this context, meansperforming the task of crawling in the same manner as the default version ofthe crawler does . This, in turn, suggests that correctness means producingthe same output as the the original version. So the default version can beutilized to achieve an oracle for assessing the proposed version. Moreover,this oracle can be used to determine to what extent the new version performsin equivalence to the default version. As mentioned in the introduction, thecrawler builds a model out of the crawling of a Web application in form of astateflow graph. We utilize this stateflow graph model to compare whetherour proposed crawler produces the same output as the default version.There are two important issues that need to be addressed before we usethe stateflow graph for shaping our oracle. First, we should beware that weare implicitly assuming the default version of the crawler is deterministic.Deterministic in this context means if we crawl the same Web applicationrepeatedly, the same stateflow graphs are built as the result of the crawlingsessions. In addition, we need to have a Web application that its statetransitions remain the same over the course of multiple crawling sessions. Inother words, the Web application must not behave randomly, must not havetime-based behavior or any other types of behavior that leads to differentstateflow graphs over multiple crawling sessions.The first step in this experiment is to either find or build a deterministicWeb application for being used as objects of the correctness experiments.Building one such web application seemed to be more feasible, if not theonly option, because we still cannot use the crawler for finding deterministicWebsites. The reason is that we do not know if the crawler is determinis-tic yet. As such, we built a Web application that has multiple clickables.Firing click events on these clickables causes the state of the application(the DOM-tree) transition to numerous states. The key point here is thatthe application is designed in a way that clicking on a specific sequence ofelements will always result in the same sequence of state transitions. ThisWeb application is assured to work deterministically being tested thoroughlyand manually. Having a simple and manageable design, the application iscompletely deterministic.Having confidence on the determinism of the object application, we needto investigate if the crawler has a deterministic behavior too. Determinismfor the crawlers means that, given the Web application is deterministic, thepartial stateflow graph created by the crawler is identical over multiple crawlsessions. This means if we set a limit on the total number of states crawled,676.6. Evaluationand crawl the Web application twice, the resulting stateflow graphs, SFG1and SFG2 must have the following properties:? The number of states found in both stateflow graphs, SFG1 andSFG2, must be equal.? All the states present in SFG1 must be present in SFG2 too and viceversa.? Given there is a transition T1 between States S1 and S2 in SFG1,there must be a transition T2 in SFG2 which joins counter parts of S1and S2.By partial stateflow graph we mean the states of the Web application is notexhaustively discovered to a point that no further state is detectable in theWeb application.In order to assess if these qualities hold in the crawler, we conduct anexperiment to answer the following research question:RQ2.0 : Is the default version of the crawler deterministic in crawling Webapplications?In order to answer this question, we crawl the object application wedeveloped ten times and compare the stateflow graphs to see if they holdthe above qualities. The comparisons result shows that the stateflow graphshave the above criteria. This means we can rely on the deterministic natureof the crawler to form our correctness oracle.Having discussed the determinism of the crawler, we proceed with in-vestigating if the graph database-backed solution works correctly. To assessthe correctness of the proposed solution, we crawl the deterministic Webapplication with both the default crawler and the crawler enhanced by oursolution. Afterward, we compare the two state-flow graphs based on theproperties listed above.In order to further investigate the correctness of the proposed solution,we do not limit the experiments to the deterministic Web application devel-oped by ourselves. Having ensured the default crawler is deterministic, weuse it to find more deterministic Web applications. The solution we came upwith for finding additional deterministic Web applications is that we crawla Web application multiple times with the default browser and compare thestateflow graphs. If the stateflow graphs were equal then we deduce theapplication is deterministic and can be used in correctness experiments.686.6. EvaluationWe started running the determinism test on top Alexa Websites to checkif we could find objects for our experiments. However, these most visitedWeb applications are highly complex and we found none of the top ten weredeterministic. We also tested other simple applications that we expectedthey might be deterministic but still none of them were. What we did westarted crawling random Websites hoping that we could find some determin-istic Web applications. Fortunately we found a number of Web applicationsthat they were deterministic. So we used them as objects of the correctnessexperiments. The application that found to be deterministic are listed inthe table 6.3 and the Websites crawled and were not deterministic are listedin the appendix A .Table 6.3: Correctness experiments objectsStates Crawled Web Application50 www.al-awda.org50 www.martinkrenn.net50 www.thething.it50 www.c-level.org50 www.rprogress.org/index.htm50 www.math.mcgill.ca det1 www.isc.org/downloads/BIND1 www.hunyyoung.com2 home.planet.nl/mooij3211 www.antique-hangups.com3 www.project451.com1 lmuwnmd.wpengine.com/...Scalability Experiments As we finished the correctness experiments,having assured that the new solution crawls Web applications in the samemanner as the default crawler and produces the same output, we pro-ceeded with the memory performance experiments. However, the first resultsshowed that our solution was not able to outperform the default version interms of the number of states it crawled. We concluded that it is either asign of inefficient memory consumption in our solution, or an indicator oftoo huge a memory overhead imposed by the solution .As a result, we conducted a comprehensive memory analysis on the de-696.6. Evaluationfault crawler and the new solution to understand how exactly the memoryis consumed by different components of the crawler and finding inefficienciesin them. This memory analysis gave us insight into the internal behaviorof the crawler and helped resolve the inefficiencies. The memory analysis ispresented after the experiments results in section 6.7.Conducting the scalability experiments, we investigated whether the newversion could improve the number of states crawled compared to the defaultversion of the crawler as stated in the research question RQ2.b. To this aim,we crawled a number of top Alexa Web sites by the proposed crawler andthe default crawler to compare memory utilization in the two versions of thecrawler. The important variables and facts about the experiments are asfollow.? Top 10 Alexa Websites are crawled as the objects of our scalabilityexperiments.? We limit the memory by setting the maximum Java heap size to oneGigabyte.? The most important criteria for comparison is the number of statescrawled given the limitation on the memory usage.? For each crawling experiment the default version and the enhancedcrawler are run by the same configurations.? The maximum number of states is set to be unlimited so the crawlercrawls till it uses up all the memory, given the Web application hasenough states.? DOM size for each state is measured and recorded.To evaluate the memory performance of the proposed solution we crawledten most popular Websites from the Alexa top sites. With each versions ofthe crawler, we crawled each Web application five times and then took theaverage of the results over the five crawling sessions. In each experiment wecollected the average DOM size of the states in characters. Most importantlythe number of states crawled up to the very point before throwing the out-of-memory exception was recorded in the experiments. The objects of thescalability experiment are presented in table 6.4.As mentioned in the background, there is always a limitation on theamount of memory available to Java heap, be it a start up option or themaximum capacity of the system on which it is running. We, however, set706.6. EvaluationTable 6.4: Scalability experiments objectsID Web Applicationggle www.google.comwkpd www.wikipedia.comlive www.live.comtwtr www.twitter.comqq www.qq.comamzn www.amazom.comlnkn www.linkedin.combaid www.baidu.comfb www.facebook.comyaho www.yahoo.comthis limitation to one Gigabyte. We selected 1 Gigabyte as the upper limitbecause in a system used by developers for programming, such as the systemwe have been using, 1 Gigabyte is almost the maximum amount that canbe allocated to the Java heap while being able to run other tasks, e.g. theexperiments in our case, smoothly without having the system freeze. Theexperiments end when the crawler allocates all the Java heap memory andthrows an out of memory error because there is no more memory left toallocate to the requests of the crawling process.We had various options for measuring memory consumption in the ex-periments. We utilized VisualVM and Yourkit Java profilers for monitoringthe memory consumption of the two versions of the crawler. In section 6.6.3we discuss the results of the experiments.Time Performance Experiments As mentioned in the explanation ofRQ2.c, time performance in this context is assessed on a holistic approachand is based on time required to crawl Web applications by the two versionsof the crawler. In particular, we aim at investigating which version requiresless time to crawl a certain number of states in the object Web applications.There are a number of important differences between these two versions ofthe crawler that are determining in the result of the experiments. In par-ticular we are interested in investigating the counter effects of the followingcharacteristics.First, the scalable version utilizes a database that resides in a secondary716.6. Evaluationmemory (hard disk). This means the database requires to perform a consid-erable number of I/O requests and as a result it might consume significanttime while storing and retrieving data. On the other hand, and more inter-estingly, the database does not have to lock the stateflow graph to add orupdate the data in contrast to the default version which locks the stateflowgraph and as a consequence, threads cannot update the model simultane-ously in the default version.The objects of these experiments are the same as the scalability listed inthe table 6.4. We crawl each of these Web applications five times with bothcrawlers and measure the time. The maximum number of states is set to 100and the memory of the Java heap is set to a maximum of one Gigabyte. Asmentioned in the scalability experiment, there is always a maximum for theJava heap. If we had not set the maximum manually it would have been thedefault amount for the maximum. Hence, given the fact that all objects werecrawlable up to 100 states within utilizing 1 Gigabyte of memory, we optedto set this upper limit which also goes well with scalability experiments too.Having covered the design and methodology of the experiments, wepresent the results of the experiments in the following section.6.6.3 ResultsThe results of the experiments are presented here to answer the three re-search questions that were explained in the previous sections. These researchquestions targeted the correctness, scalability, and time performance of theproposed solution.RQ2.A: Correctness We conducted the correctness experiments on thethe Web application listed in table 6.3. In addition, we ran the experimenton the deterministic Web application that we developed as well.The results show that the new solution works correctly in crawling all of theWeb applications crawled. Hence, the answer to RQ2.a is positive.The scalable crawler produces completely the same output as the defaultcrawler. In fact, the first correctness experiment that we ran was not positivebut that led us to find the bugs and fixed them to achieve a point that theresults were positive for all Web applications.RQ2.B Scalability The gist of the results of the scalability experimentsare presented in the bar charts of figure 6.2 and 6.3. As it is presented in726.6. Evaluationfigure 6.2, the result show in all experiments the proposed solution performsmore efficient and demonstrates more scalability than the default version.In addition, as it is shown in figure 6.3, the proposed crawler achieves im-provements in all experiments with an overall average of 88.16%. Hence, theanswer to question RQ2.b is:The proposed solution improves the scalability of crawling process by 88.16%on average.Table 6.7 shows how the new solution outperforms the default version whiletested on different Web application objects. This table shows the details ofthe results.ggle wkpd live twtr qq amzn lnkd baid facb yaho averageScalable vs. Default No. of States CrawledWeb ApplicationNo. of States050010001500200025003000 DefaultScalableFigure 6.2: Scalable vs. default no. of states crawled: in all cases theproposed version overtakes the default crawler in terms of the number ofstates crawled given a limited amount of heap memory.RQ2.C: Time Performance As presented in table 6.6, the default crawlerperforms more efficiently than the proposed solution. This is, in fact, nosurprise to our expectations. As discussed in 6.5.2, the time performancetrade offs prepares for the fact that introducing a slower component suchas secondary memory to the crawling process can slow down the system.Moreover, compared to the scalability achieved by the solution, the timeoverhead is not too intolerable. Hence, the answer to question is: although736.7. Further Memory Analysisggle wkpd live twtr qq amzn lnkd baid facb yaho averageScalability ImprovementsWeb ApplicationImprovement Percentage050100150200250300350Figure 6.3: Scalability improvements: the percentage of increase in the num-ber of states crawled.the proposed solution outperforms the default crawler in two cases, and insix out of ten cases the time overhead is less than 10%, on average the timeperformance of the default crawling technique sounds more promising.The proposed solution imposes a time overhead which is 18.20% on averageand if Facebook.com is excluded from the results, the rest of the experimentsshow a time overhead of 6.13%.6.7 Further Memory AnalysisAs mentioned earlier we performed a comprehensive memory analysis onboth the default crawler and the enhanced version. We performed thisanalysis to achieve more insight on the way memory is utilized in the heapso as to crawl a greater number of states by means of improving the memoryperformance.The initial goal of this memory analysis was to discover the inefficienciesof the proposed version so as to achieve more scalability. This lead to findingtwo memory leak points and patching them. However, after finding thememory leak points and achieving the scalability, we continued the analysisin more depth so it can be utilized in future memory optimizations.746.7. Further Memory AnalysisTable 6.5: Scalability experiments results: the proposed version outperformsthe default crawler in terms of the number of states crawled.ID Default States Scalable States Improvement (%)ggle 257.4 1057.6 310.88wkpd 347.6 919.6 164.56live 1407.4 2593.4 84.27twtr 114 133.2 16.84qq 130.2 176.4 35.48amzn 284.8 412.8 44.94lnkd 1357.8 2506.8 84.62baid 247 303.4 22.83facb 1362.4 2096.4 53.88yaho 331.4 541 63.25average 584 1074.06 88.16We have considered two main approaches to compare the efficiency ofmemory consumption of the two versions of the crawling algorithm. First,we can limit the amount of memory available to both crawlers and measurethe number of states crawled before consuming all memory. The other op-tion is to limit the number of states and measure the amount of memoryconsumed in the crawling. Both of these methods have some advantagesand disadvantages.As the aim of this work was to increase the number of states crawledgiven a limited amount of memory (e.g. one Gigabyte), measuring the num-ber of states is a more intuitive approach for carrying out the experiments.However, measuring the memory gives us more insight on what internally istaking place and helps us understand if the memory is used efficiently. More-over, measuring the memory consumption can pinpoint pointers for findinginefficient parts of the crawler and fixing them. Hence we use both methodsto achieve a deeper insight over the memory consumption characteristics ofthe crawlers.In order to compare the scalability of the two versions of the crawler westarted off by carrying out an experiment with the specifications listed intable 6.7.What we measured here was the number of states crawled given thelimited amount of memory. None of the crawlers exceeded the time limit.In fact, the time limit was deliberately set to 10 hours as by experience756.7. Further Memory Analysisggle wkpd live twtr qq amzn lndk baid facb yaho AverageScalable vs. Default Time PerformanceWeb ApplicationTime (s)050010001500 DefaultScalableFigure 6.4: Scalable vs. default time performance: the proposed versionovertakes the default crawler only in two cases out of ten and in six casestime overhead is less than 10 %. However, on average the time overhead is18.20%.we knew that this amount of time is enough for both crawlers to exceedthe memory limit. The results however, to our surprise, showed that theproposed solution did not outperform the default crawler in terms of thenumber of states crawled. The default version crawled 533 states while theproposed solution crawled only 378 states.Availing from the second approach, we carried out another experimentby profiling the crawlers using VisualVM profiling tool. We set up a secondexperiment with specifications outlined in table 6.8.The result of the experiment, however, shows that the default versionis consuming less memory than the enhanced version which was surprisingfor us. The memory consumption in the default version shown in figure 6.5reaches a maximum of 78 Megabytes while the enhanced crawler exceeds amemory consumption of 197 Megabytes which is shown in figure 6.6. Asa result, we repeated the same experiment but changed the object of theexperiment, by crawling www.yahoo.com instead of the previous URL. How-ever, the results are not still promising. We observe, as shown in the figures6.7 and 6.8, a spike at the early stages of crawling for both crawlers. Af-terward, non of the crawlers shows significant improvement over the otherversion.By observing the memory consumption graphs of VisualVM and the766.7. Further Memory AnalysisTable 6.6: Time performance experiments results: the default crawler per-forms more efficiently on most of the cases. However, in two cases theproposed solution overtakes the default crawler and in six out of ten casesthe time overhead is less than 10%.ID Default Time (s) Scalable Time (s) Overhead (%)ggle 219.2 231.8 5.75wkpd 439.4 557 26.76live 510.2 512.4 0.43twtr 1237.8 1535.8 24.07qq 837.2 952.4 13.76amzn 310.2 271.4 -12.51lndk 196 170.6 -12.96baid 969.4 1007.6 3.94facb 137.8 312.6 126.85yaho 281.6 298.2 5.89Average 513.88 584.98 18.20Avg Excl. facb 555.67 615.24 6.13log output of the crawlers, we come to conclusion that we need to revisethe experiments. First, observing considerable spikes at the beginning ofthe crawling proves that 100 state is not a justifiable limit for the numberof states because the memory required for the storage of the states is notconsiderable enough to make the maximum happen in the end of the crawlingrather than the beginning. In addition, the graphs produced by VisualVMdo not distinguish between old generation and young generation. Hence weconsidered utilizing a more powerful tool providing us with more informationabout the memory utilization.In fact, automatic garbage collection in Java, which cleans the unusedmemory in a relatively complicated way, makes memory measurement ex-periments more cumbersome. The main reason is that, the frequency ofgarbage collection is not dependent solely on having unused memory to becleaned, i.e. the frequency is not constant. The frequency is also depen-dent on the amount of memory available to be allocated. The less memoryavailable in the heap the more frequent garbage collections occur.In addition, there are two types of garbage collection, major, and minorgarbage collections, each of which affects the used memory measurements776.7. Further Memory AnalysisTable 6.7: Initial memory experiment specificationsAspect SpecificationURL www.yahoo.comMaximum Number of States UnlimitedMaximum Depth 10Maximum Time 10 HourMaximum Memory 1.5 GigabyteTable 6.8: Second memory experiment specificationsURL www.google.comMaximum Number of States 100Maximum Depth 10Maximum Time 10 HoursMaximum Memory 1.5 GigabytesProfiling Tool VisualVMdifferently. There is almost always a difference between the amount of mem-ory allocated and the amount of memory that we expect to be allocatedbased on our assumptions about the algorithm of a Java program. Hence,measuring memory in specific points of times, while not having confidencewhether the garbage collection is performed or not is not the best availableindicator of memory utilization.In the default crawler, as states? data is detained in the memory till theend of the crawling, we expect this data is not garbage collected, neither bymajor collection nor by minor collection. On the other hand, when runningthe proposed crawler, we expect the states data be garbage collected oncein a while as all references to this data are gone after saving the state datain the database. As such, we need the young generation and old generationmeasurements to understand if the objects related to the state in the memoryis garbage collected or not.Hence, instead of comparing memory at specific points of time, we needto have old and young generation data, and observe their trends. What weexpect to observe is that, the states information is freed in the enhancedcrawler by comparing the trends of the old generation in two crawlers. So786.7. Further Memory AnalysisFigure 6.5: Memory consumption of the default version: the experiment iscarried out on google.com and the memory consumption is demonstrated.we profile both crawlers with Yourkit Java profiler to achieved a betterunderstanding of the memory consumption in both crawlers.First of all, we found two leak points in the proposed crawler. In factwhat held us from achieving more scalability was that as the states werebeing added to the the stateflow graph one by one, they were also beingreferenced in a another sub-component of the crawler that was responsiblefor managing the paths of the crawling. By fixing leak points in thoseparts we achieved the scalable results presented in the scalability experimentresults.Utilizing Yourkit profiler, we took snapshots of the memory usage of thedefault crawler and discovered the information summarized as follows:? State machine method in which the new states are created (newState-For) amounts for 52 percent of the memory allocated to the objects.? Inspecting the new DOM amounts for 45 percent of the memory allo-cated to objects.? 99 percent of the objects created by the newStateFor are state objects.? Objects allocate by the inspectNewDOM method are mostly DOMrelated objects such as HTML element, HTML anchor element, HTMLscript element etc.796.7. Further Memory AnalysisFigure 6.6: Memory consumption of the proposed version: the experiment iscarried out on Google.com and the memory consumption is demonstrated.We conducted the same analysis on the snapshot taken from the scalableversion and the information is summarized in the following:? State machine method in which the new states are created (newState-For) amounts for 52 percent of the memory allocated to the objects.? Inspecting the new DOM amounts for 64 percent of the memory allo-cated to objects.? Similar to the default version, Objects allocated by the inspectNew-DOM method are mostly DOM related objects such as HTML element,HTML anchor element, HTML script element etc.? The method follow(), which brings the browser from index to the statethat must be explored, allocates 24 percent of the memory to objectsmostly used for IO.? States object retain only 2 percent of the memory? The candidate actions amount for 69 percent of the memoryWe performed seven different memory inefficiency inspections on theboth crawlers to diagnose potential memory inefficiencies.806.7. Further Memory AnalysisFigure 6.7: Memory consumption of the default version: the experiment iscarried out on Yahoo.com and the memory consumption is demonstrated.The plot of memory utilization by two versions are provided in figures6.9 and 6.10As it was expected the long generation for the default crawler is ascend-ing because states are not freed and retained in the memory. However, forthe proposed solution the freeing of states objects makes the graph descend-ing at times.As presented in the tables and figures of this chapter, the proposed solutionworks correctly in crawling all object Web applications. In addition, theproposed solution improves the scalability of crawling process by 88.16%.This improvement is achieved on a time overhead cost. The proposed crawlerimposes a time overhead which is 18.20% on average and if Facebook.comis not considered, the rest of the experiments suggests a time overhead of6.13%.Having illustrated our improvement ideas on the scalability of the crawlerand presented the the results of the experiments, we revisit the questionsand the answers obtained for them in this work in the next chapter.816.7. Further Memory AnalysisFigure 6.8: Memory consumption of the proposed version: the experimentis carried out on Yahoo.com and the memory consumption is demonstrated.826.7. Further Memory Analysis0 500 1000 1500 2000 2500 3000 3500020040060080010001200Java Heap Memory Consumption of Default Crawler Time (seconds)Memory (Megabytes)Old GenerationYoung GenerationHeap SizeFigure 6.9: Default memory consumption: the lowest curve (in red) showsold generation allocation. On top of that, the young generation (in salmon)is presented. The highest curve (in blue) shows the maximum amount ofmemory available to be allocated, i.e. Java heap size. The important obser-vation here is that the old generation allocation never drops as all the statesare detained in memory. In this experiment Java maximum heap size is setto 1 gigabyte and 330 states are crawled.836.7. Further Memory Analysis0 2000 4000 6000 8000 10000 12000 14000020040060080010001200Java Heap Memory Consumption of Scalable Crawler Time (seconds)Memory (Megabytes)Old GenerationYoung GenerationHeap SizeFigure 6.10: Improved memory consumption: the lowest curve (in red) showsold generation allocation. On top of that young generation is presented(in salmon). The highest curve (in blue ) shows the maximum Java heapmemory available to be allocated. One of the results of our work is that theold generation allocation occasionally drops, i.e. states data is freed. Thisenables the crawler to discover 542 states using 1 gigabyte of memory.84Chapter 7DiscussionHaving presented the proposed enhancements and their evaluations in thelast two chapters, in this chapter we revisit the research questions and dis-cuss the findings in more details. We review the findings of the state transi-tion management work first and then continue the section with dissertatingthe answers to the research questions of the scalability work. Furthermore,we discuss the ?threats to validity? of the work as well. Finally the ideasemerged from this thesis for the direction of future work finalizes this chap-ter.7.1 State Transition ManagementTime performance improvement was the target of the enhancements we putforward for enhancing the state transition management of the crawling pro-cess. In order to achieve the improvement, we brought forth and developedthree alternative methods for managing the transition of states in the crawl-ing process. The results of evaluation of these methods, revised incremen-tally one after another, are discussed in this section.7.1.1 RQ1.A: Time Performance of the Proxy-BasedSolutionThe goal of the experiments on the proxy-based solutions was to examineto what extent this method is able to improve the time efficiency of statetransition management. As it is shown in figure 5.5, the time overheadthe proposed enhancement imposes on the crawling process is significantlygreater than the time improvement it brings about. In eight cases out of ten,the default crawling process performs better than the proposed proxy based.Moreover, in seven cases out of of the aforementioned eight cases, the defaultcrawler consumes significantly less time than the proposed solution. Theprospered solution overtakes the default crawling process only in two casesand the performance difference in these two cases are not as considerablecompared to difference in the cases where it is overtaken by the default one.857.2. ScalabilityAll in all, as it is shown in the average bar, the default version performsbetter than the proposed one. Having performed this experiment, we nowhave a better insight on the effects of utilizing a proxy for time optimization,in the crawling process. In fact, the findings of this part of the work helpedguide the rest of the study to a great extent. For example, we had plannedto form two other projects for which proxy utilization would have been apivotal issue.7.1.2 RQ1.B: Time Performance of the First Plugin-BasedSolutionSimilar to the RQ1.a, the goal here was to assess the time performance butfor the second solution we proposed. Having revised this solution based onthe lessons learned from the development and evaluation of the previoussolution, we assessed the time performance of the solution by the same setof experiments as that of the proxy-based solution.As it is shown in the bar-chart of figure 5.6, the first plugin-based so-lution, improves the time performance of the state transition managementto a great extant. The proposed solution overtakes the default version in6 cases and in the rest of the cases the difference is not very significant.The proposed solution, on average, improves the time consumption of thetransition management process by 253.34%.7.1.3 RQ1.C: Time Performance of the SecondPlugin-Based SolutionThe time performance of this version of the crawler was assessed in the verysame way as its two predecessors. As it is illustrated in figure 5.7, the thirdsolution improves the time performance of the state transition managementof the process. Although this version of the crawling process was theoret-ically more improved than the second solution, the results show that thisversion, on average, improves the time performance of the transition man-agement process by 197.48% which is less than the second solution. However,this version also significantly overtakes the default version.7.2 ScalabilityIn the scalability part of this work, we developed the idea of creating analternative solution for storage of web application states? data. The alterna-tive solution employs a graph database for storing the data to let the main867.2. Scalabilitymemory be available to the rest of the tasks in the crawling process. Wedevises three research question targeting the correctness memory and timeperformances of the crawling process. In the process of carrying out thework further questions emerged which we seized the opportunity to examinethem.7.2.1 RQ2.0: Determinism of the CrawlerIn the process of designing the experiments for pursuing answers to RQ2.awhich targeted the correctness of the proposed solution, we faced the chal-lenge of developing an oracle for examining the correctness of the solution.As we opted to utilize the default solution in shaping the oracle, we had tofigure out if the crawler behave deterministically. Hence, we formed RQ2.0and carrying out the experiments, the result showed that the default crawleris completely deterministic.7.2.2 RQ2.A: CorrectnessHaving assured the determinism of the default crawler, we built and har-vested a number of object web applications and assessed the correctness ofthe proposed solution. The oracle used for testing the correctness of thesolution was formed over the stateflow graph that the crawler builds outof crawling a Web application. The results of the experiments show thatthe proposes solution crawled all the object Web applications completelycorrectly. It discovered the same stateflow graphs as the default crawler.7.2.3 RQ2.B: Memory PerformanceHaving obtained full confidence on the correctness of the proposed solutionfrom the experiments results, we continued with the memory performancepart of the work. RQ2.b investigates to what extent the proposed crawlerincreases the number of states crawled. The answer to this question is illus-trated in figures 6.2 and 6.3. As it is shown in the figures, the proposes so-lution is able to outperform the default crawler in all instances and achievesa greater state coverage. The proposed solution, on average, improves thescalability by 88.16%.The scalability improvements achieved here can potentially be useful ifthe crawler will be deployed as a cloud computing service (e.g. on AmazonWeb Services). Especially, supporting partial locks on the stateflow graph,the proposed crawler can help implement an efficient concurrent deploymentof the crawler. On the contrary, the default version of the crawler needs to877.3. Threats to Validitylock the whole stateflow graph when it requires to add states or transitionsto the data structure.7.2.4 RQ2.C: Time PerformanceIn order to improve the scalability of the crawling process, we opted forcreating an alternative solution utilizing a graph database. As mentioned inthe discussion of the trade-offs, we investigated the effects of the discussedcounter-effects. As the results of the experiments show, the default versionof the crawler outperforms the proposed solution in most of experiments.However, in two out of ten instances the proposed solution overtakes thedefault version and in five out of ten cases the time overhead imposed bythe proposed solution is less than 10%.Having provided an overview over the research questions and the answersobtained by the experiments, we move on with discussing the threats tovalidity concerns.7.3 Threats to ValidityThis threat to validity component sheds more light on some of the concernswe have had during this work. We provide details on how we endeavored tomitigate them. In this section we address internal and external threats tovalidity of this work.7.3.1 Internal ValidityIn the first part of this work, the state transition management improvements,we encountered a substantial obstacle. The top Alexa Websites are highlycomplex Web applications and they are not deterministic by any means. Inorder, to mitigate the determinism problem of the experiments, we decidedon running a greater number of experiments on each object and then take av-erage over results. However, after carrying out the experiments, we observedthe state transition management part of the crawling process amounts fora tiny portion of the total crawling time. As a result, we strove on findinga method that could compare the two versions of the crawler on the samecircumstances. Hence, we put forward the idea of having both versions ofthe state transition management plugged into the crawler at the same time.This way, when checking whether the application has transitioned to a newstate or not, we first use the mutation-summary library to inquire aboutchanges made to the DOM. Afterward, regardless of the report provided by887.4. Future Workour agent, we perform the default comparison behavior. Now, if the muta-tion summary reports of no change and the comparison also reaffirms it, thetime spent for the default comparison is regarded as the time that wouldhave been saved if we had plugged in only the proposed solution. We mea-sure the time each of the solutions consumes on performing its tasks andthen compare them bases on these times.7.3.2 External ValidityRegarding the external validity of this work, the most important concernhere is how well the conclusion can be generalized. Firstly, the Web ap-plications selected as the objects of the experiments can highly affect thefinal results. The set of all Web applications on the Internet is too large tofind a fairly acceptable representative subset for it. Hence, to mitigate thesystematic error, i.e. the bias in selecting object Web applications, we optedto choose the most visited Websites from Alexa top list.In addition, configurations of the crawler for running the experimentsprovide us with a wide set of options; for example, the types of HTMLelements chosen to be clicked on during the crawl can be pivotal to thefinal results of the experiments. In order to mitigate this bias, we caredfor selecting the configurations that are most natural, general and mean-ingful for crawling a Web application for testing, comprehension or analysispresupposes.Having discussed the threats to validity of this work, we move on withoutlining the directions this research for future work.7.4 Future WorkThroughout performing multiple projects that shaped this thesis, we achieveda greater insight on the area of crawling modern Web applications. Fromthe experiences learned in this thesis, we bring forth three ideas to movethis work to the next level. These ideas, from a higher level perspective, liein the areas of improving the time performance and further improving thescalability of the crawling process.7.4.1 Further Scalability through String OptimizationOne of our presumptions about the crawling process was that memory al-located to states? data is the only obstacle in achieving absolute scalability.That is, we assumed if the memory is not used for states storage, we can897.4. Future Workendlessly resume crawling. However, from performing the memory analysis,we discovered two important areas of limitation. First, one of the diagnostictests for the memory consumption of the crawler revealed that a very con-siderable amount of memory is waisted by having multiple instances of thesame string in the memory. Further investigation showed that these stringsoriginate predominantly from the HTML attributes and elements extractedform the DOM-tree. Possible solutions that we can suggest is to employstring interning techniques. The second area of memory optimization fol-lows in the next section.7.4.2 Further Scalability through Candidate ElementsOptimizationPerforming the memory analysis, we discovered that storage of candidateclickable elements is the number one cause for consuming up the memoryresources. As such, we think the next area which is worth investigatingto improve the scalability, is developing alternative methods for storage ofcandidate clickables. This can be carried out either by utilizing a graphdatabase, or changing the order of storage of the elements. That is, it ispossible to change the logic so there is no need for storing all the candidateclickables of all the stored states in the memory.7.4.3 Scalability Improvement by Text CompressionTechniquesAs string objects containing HTML elements consume a significant amountof memory in crawling Web applications, we believe that utilizing textcompression techniques has the potential to improve the scalability of thecrawler. Especially, large DOM strings of the same application may sharea considerable portion of content which brings about memory optimizationopportunities.7.4.4 Improving Time Performance TargetingRQ1 targeted the time performance of the state transition management ofthe crawling process. As presented in the result, the improvements achievedin the second and the third solutions were very significant. In addition, thelessons learned from the proxy-based solution were also very instructional inguiding the directions of the rest of the project, specially which directionsnot to pursue. However, we also learned from the second part of the project,during the memory analysis, that utilizing a tool such as YourKit profiler,907.4. Future Workwe might be able to find more significant areas for optimization. That is, weare not sure, but it is possible that among manifold areas of optimization,pinpointed first by our insight over the crawling process, a preliminary com-parison over the significance of the selected areas, results in more significantachievements.91Chapter 8Related WorkWeb crawlers have been a major area of research [22?34] since the inven-tion of the Web. Optimizing Web crawlers, as a consequence, has been anattractive research topic too [35?37]. Here we discuss a number of studiesin the literature focusing on optimizing the performance of different Webcrawlers. In addition, as we performed a comparison on graph databases,we enumerate some other studies comparing graph databases too.8.1 Optimizing the Crawling ProcessJourdan et al. propose a strategy for efficient crawling of modern Web ap-plications (which they call rich Internet applications) [38]. The proposedmethod creates a model for predicting the behavior of modern Web applica-tions, and based on this model produces an execution plan that is optimizedfor finding new states in the shortest amount of time. Afterward the model isupdated based on the discovered states if the new states are not compatiblewith the previous model. The authors compare their techniques with othermodern Web crawlers and depth-first and breadth-first crawling strategies.In another study [39], Jourdan et al. improve their previous work [38]by utilizing statistics gathered during a crawl session to predict optimizedstrategies for selecting the next part of the application to be crawled. Theauthors compare the performance of their proposed strategy with breadth-first strategy, depth-first strategy and their previous crawling strategy.In 1999, Heydon et al. [35] investigated the inefficiencies in their JavaWeb crawler called Mercator. In particular they found that they have torefactor their crawler to make their code more aligned with regards to thecharacteristics of Java Runtime Environment such as the mechanism mem-ory is allocated and deallocated in Java Heap and excessive synchronization.They investigated the synchronization, memory management and other as-pects of the Java and achieved performance improvement in the crawler. Inanother paper [36], they explain the Web crawler implemented in Java, byenumerating the components of web crawlers and explaining the alterna-tives and trade-offs of different approaches. They compare the performance928.2. Graph Database Applicationsof their crawler with that of GoogleTMin terms of rate of downloading doc-uments, the volume of data downloaded, and HTTP request sent.Another study in the area of optimizing the crawling process in webapplications was done by Edwards et al. [37] at IBM research center. Thisstudy focuses on optimizing the Webfountain web crawler. The Webfountainis an incremental web crawler, i.e. the crawler updates the versions of thewebsites it has already crawled when they are updated.Edwards? paper has a target that is relatively different from the aim ofthis thesis. They aim to optimize various competing criteria of crawlingWeb, such as freshness, in Webfountain. They propose an adaptive modelthat optimizes managing the URL queue and internal variables of crawlingin Webfountain. They utilize simulation and computational models to eval-uate their work, but hope to evaluate it on real data once Webfountain isoperational.8.2 Graph Database ApplicationsIn this thesis we compared many graph databases to select the graph databasemost suited to our applications. Graph databases have been either the tar-get of research or been utilized in many research projects [40?48]. We foundresearch focusing on comparing different aspects of graph databases [49?52].A paper which is to some extent similar to our comparison is performedon the comparison of models of graph databases; in this paper, Angels etal. [53] investigate the area of modeling graph databases in a survey. Theyfocus on the various aspects of modeling such as query languages, datastructures, and integrity constraints. We have availed of these studies, inperforming a comparison of our own, for choosing the best database suitedto our application.93Chapter 9ConclusionAiming at improving time consumption and scalability of crawling modernWeb applications, we proposed and evaluated our enhancements ideas inthis thesis. Availing of efficient DOM monitoring techniques, we focused onproviding three alternative solutions for reducing time consumption of the?state transition management? sub-task of the crawling process. The mainidea for improving the time consumption was to bypass unnecessary steps ofthe state transition management by efficient tracking of alterations made tothe DOM. In addition, utilizing a graph database for storage and retrievalof dynamic Web states, we aimed at increasing the number of states crawledby freeing the memory detained by states during a crawling session.We used Crawljax, a modern Web crawler, as the platform for actualizingour ideas into concrete implementations so as to evaluate them. As such, wereverse engineered Crawljax to gain insight on the default crawling processand applying our ideas in terms of enhancements to Crawljax.In order to improve the time consumption of the state transition manage-ment, we developed three alternative solutions that were revised incremen-tally one after another. In all three solutions, an agent for tracking DOMmutations was developed to be installed on browser side of Web applications.The first solution, utilized a proxy for installing the monitoring agent anda Web server for hosting the agent. The proxy was replaced by Crawljaxplugins in the second solution but the Web server was retained for hostingthe agent. Finally, in the third solution, agent installation was carried outby Crawljax plugins in the same manner as the second solution but the Webserver was eliminated from the design and the agent was incorporated in thecore of the solution.The crawling process in the crawler captures dynamic state changes andnavigational paths between the states to form a finite-state machine modelcalled stateflow graph. Aiming at reducing main memory consumption toincrease the number of states crawled, we proposed the idea of providing analternative solution for storing the stateflow graph data structure in a graphdatabase on secondary memory. We conducted a comparison over graphdatabases in which Neo4j emerged to be the graph database most suited94Chapter 9. Conclusionfor our application. We implemented a scalable solution utilizing the graphdatabase to free memory from states? data and provide the rest of crawlingtasks with freed memory to continue the crawling to cover a greater numberof states.The proposed solutions were evaluated by comparing them with the de-fault crawler. Five rounds of experiments were carried out on ten Webapplications selected form Alex most visited websites. In experiments onstate transition management, maximum number of states was set to 10 andthe time consumptions of the default crawling process and each enhancedversion were measured simultaneously. In the state coverage improvement,the scalable crawler output (stateflow graph) was compared with the outputof the default version to ascertain the correctness of the proposed solution.The maximum amount of Java heap memory was set to one Gigabyte and thenumber of states crawled was measured to evaluate the scalability of the pro-posed solution. To assess the scalable solution in terms of time consumption,maximum number of states was set to 100 and the total time consumptionsof the scalable solution and the default crawler were measured.The results of the experiments showed that proxy-based solution wasnot able to reduce the time consumption of the ?state transition manage-ment?. Having learned that proxy imposed a significant time overhead onthe crawling process, we changed the direction of the thesis. The secondsolution, revised based on the results of the proxy-based solution, was ableto improve the time performance of the state transition management , onaverage 253.34%; that means time consumption ratio of default to proposedis 3.53. For the third solution, the ratio of time consumption of defaultversion to time consumption of proposed solution is 2.9748 which means197.48% improvement in time performance.Comparing stateflow graphs created by scalable and default crawlersupon crawling deterministic web applications, we observed the scalable so-lution crawled correctly in the exact same manner as the default crawler.The proposed solution improved the scalability by increasing the number ofstates crawled, on average, by 88.16%. However, this scalability improve-ment was achieved costing a time overhead of 18.20%.A memory analysis was conducted both on the scalable solution and thedefault solution. The main outcomes of this memory analysis were findingtwo memory leak points in the scalable solution, resolving which led to theachieved state coverage improvements; and identifying scalability obstaclesto bring forward optimization opportunities for future work.Finally, as a measure to improve reproducibility of the experiments, inaddition to including the object Web applications in the thesis, the imple-95Chapter 9. Conclusionmentation of the scalable solution is accessible as open source software athttps://github.com/saltlab/crawljax-graphdb.96Bibliography[1] J. J. Garrett et al., ?Ajax: A new approach to web applications,? 2005.[2] ?Oracle java micro edition embedded client customization guide,? 2012.http://docs.oracle.com/javame/config/cdc/cdc-opt-impl/ojmeec/1.1/custom/html/tuning.htm, [retrieved: Sep. 25, 2013].[3] T. Berners-Lee, L. Masinter, M. McCahill, et al., ?Uniform resourcelocators (url),? 1994.[4] A. Mesbah, A. van Deursen, and S. Lenselink, ?Crawling ajax-basedweb applications through dynamic analysis of user interface statechanges,? vol. 6, p. 3, ACM, 2012.[5] D. Goodman, M. Morrison, and B. Eich, Javascript R? bible. John Wiley& Sons, Inc., 2007.[6] E. Ecma, ?262: Ecmascript language specification,? ECMA (EuropeanAssociation for Standardizing Information and Communication Sys-tems), pub-ECMA: adr, 1999.[7] D. Raggett, A. Le Hors, I. Jacobs, et al., ?Html 4.01 specification,?W3C recommendation, vol. 24, 1999.[8] ?world wide web consortium.? http://www.w3.org/, [retrieved: Sep.23, 2013].[9] ?The webkit open source project.? http://www.webkit.org/, [re-trieved: Sep. 10, 2013].[10] ?Introducing flockdb.? https://blog.twitter.com/2010/introducing-flockdb, [retrieved: Apr. 16, 2013].[11] ?Twitter.com.? https://twitter.com/, [retrieved: Oct. 7, 2013].[12] ?Seleniumhq browser automation tool.? http://www.seleniumhq.org/, [retrieved: Aug. 12, 2013].97Bibliography[13] ?Web storage w3c recommendation.? http://www.w3.org/TR/webstorage/, [retrieved: Sep. 18, 2013].[14] ?Mutation events.? https://developer.mozilla.org/en-US/docs/Web/Guide/API/DOM/Events/Mutation_events?redirectlocale=en-, [retrieved: Nov. 21, 2013].[15] R. Weinstein, ?Dom mutation events replacement: The story sofar-existing points of consensus,? Dec. 2011. http://lists.w3.org/Archives/Public/public-webapps/2011JulSep/0779.html,[retrieved: Nov. 21, 2013].[16] ?Document object model (dom) level 3 events specification.? http://www.w3.org/TR/DOM-Level-3-Events/#events-mutationevents,[retrieved: Nov. 21, 2013].[17] ?Mutation-summary javascript library.? https://code.google.com/p/mutation-summary/, [retrieved: Nov. 22, 2013].[18] ?Mutation observers vs. mutation summary.? https://code.google.com/p/mutation-summary/wiki/DOMMutationObservers, [retrieved:Nov. 22, 2013].[19] M. Shapiro, ?Structure and encapsulation in distributed systems: theproxy principle,? in Proc. 6th Int. Conf. on Distributed Computing Sys-tems(ICDCS), pp. 198?204, May 1986.[20] ?Understanding memory management.? http://docs.oracle.com/cd/E13150_01/jrockit_jvm/jrockit/geninfo/diagnos/garbage_collect.html, [retrieved: Oct. 1, 2013].[21] G. B. Tim Lindholm, Frank Yellin and A. Buckley, The Java R?VirtualMachine Specification. 500 Oracle Parkway, Redwood City, California94065, U.S.A, java se 7 ed., 2013.[22] Y. Guo, K. Li, K. Zhang, and G. Zhang, ?Board forum crawl-ing: a web crawling method for web forum,? in Proceedings of the2006 IEEE/WIC/ACM International Conference on Web Intelligence,pp. 745?748, IEEE Computer Society, 2006.[23] V. Shkapenyuk and T. Suel, ?Design and implementation of a high-performance distributed web crawler,? in Data Engineering, 2002. Pro-ceedings. 18th International Conference on, pp. 357?368, IEEE, 2002.98Bibliography[24] J. Cho and H. Garcia-Molina, ?Parallel crawlers,? in Proceedings of the11th international conference on World Wide Web, pp. 124?135, ACM,2002.[25] C. C. Aggarwal, F. Al-Garawi, and P. S. Yu, ?Intelligent crawling onthe world wide web with arbitrary predicates,? in Proceedings of the10th international conference on World Wide Web, pp. 96?105, ACM,2001.[26] S. Raghavan and H. Garcia-Molina, ?Crawling the hidden web,? 2000.[27] J. L. Wolf, M. S. Squillante, P. Yu, J. Sethuraman, and L. Ozsen,?Optimal crawling strategies for web search engines,? in Proceedings ofthe 11th international conference on World Wide Web, pp. 136?147,ACM, 2002.[28] G. Pant, P. Srinivasan, and F. Menczer, ?Crawling the web,? in WebDynamics, pp. 153?177, Springer, 2004.[29] M. Ehrig and A. Maedche, ?Ontology-focused crawling of web docu-ments,? in Proceedings of the 2003 ACM symposium on Applied com-puting, pp. 1174?1178, ACM, 2003.[30] S. Chakrabarti, B. E. Dom, and M. H. van den Berg, ?System andmethod for focussed web crawling,? July 9 2002. US Patent 6,418,433.[31] G. S. Manku, A. Jain, and A. Das Sarma, ?Detecting near-duplicatesfor web crawling,? in Proceedings of the 16th international conferenceon World Wide Web, pp. 141?150, ACM, 2007.[32] C. Castillo, ?Effective web crawling,? in ACM SIGIR Forum, vol. 39,pp. 55?56, ACM, 2005.[33] M. A?lvarez, A. Pan, J. Raposo, and J. Hidalgo, ?Crawling web pageswith support for client-side dynamism,? in Advances in Web-Age In-formation Management, pp. 252?262, Springer, 2006.[34] S. Chakrabarti, M. Van den Berg, and B. Dom, ?Focused crawling:a new approach to topic-specific web resource discovery,? ComputerNetworks, vol. 31, no. 11, pp. 1623?1640, 1999.[35] A. Heydon and M. Najork, ?Performance limitations of the java corelibraries,? in Proceedings of the ACM 1999 conference on Java Grande,pp. 35?41, ACM, 1999.99Bibliography[36] A. Heydon and M. Najork, ?Mercator: A scalable, extensible webcrawler,? World Wide Web, vol. 2, no. 4, pp. 219?229, 1999.[37] J. Edwards, K. McCurley, and J. Tomlin, ?An adaptive model for op-timizing performance of an incremental web crawler,? in Proceedingsof the 10th international conference on World Wide Web, pp. 106?113,ACM, 2001.[38] K. Benjamin, G. Von Bochmann, M. E. Dincturk, G.-V. Jourdan, andI. V. Onut, A strategy for efficient crawling of rich internet applications.Springer, 2011.[39] M. E. Dincturk, S. Choudhary, G. Von Bochmann, G.-V. Jourdan, andI. V. Onut, ?A statistical approach for efficient crawling of rich internetapplications,? in Web Engineering, pp. 362?369, Springer, 2012.[40] S. Zhang, M. Hu, and J. Yang, ?Treepi: A novel graph indexingmethod,? in Data Engineering, 2007. ICDE 2007. IEEE 23rd Inter-national Conference on, pp. 966?975, IEEE, 2007.[41] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor, ?Free-base: a collaboratively created graph database for structuring humanknowledge,? in Proceedings of the 2008 ACM SIGMOD internationalconference on Management of data, pp. 1247?1250, ACM, 2008.[42] M. Graves, E. R. Bergeman, and C. B. Lawrence, ?Graph database sys-tems,? Engineering in Medicine and Biology Magazine, IEEE, vol. 14,no. 6, pp. 737?745, 1995.[43] B. Iordanov, ?Hypergraphdb: a generalized graph database,? in Web-Age Information Management, pp. 25?36, Springer, 2010.[44] J. Huan, W. Wang, J. Prins, and J. Yang, ?Spin: mining maximalfrequent subgraphs from graph databases,? in Proceedings of the tenthACM SIGKDD international conference on Knowledge discovery anddata mining, pp. 581?586, ACM, 2004.[45] C. Vicknair, M. Macias, Z. Zhao, X. Nan, Y. Chen, and D. Wilkins,?A comparison of a graph database and a relational database: a dataprovenance perspective,? in Proceedings of the 48th annual Southeastregional conference, p. 42, ACM, 2010.100[46] R. Angles and C. Gutierrez, ?Querying rdf data from a graph databaseperspective,? in The Semantic Web: Research and Applications,pp. 346?360, Springer, 2005.[47] D. W. Williams, J. Huan, and W. Wang, ?Graph database indexing us-ing structured graph decomposition,? in Data Engineering, 2007. ICDE2007. IEEE 23rd International Conference on, pp. 976?985, IEEE,2007.[48] K. Riesen and H. Bunke, ?Iam graph database repository for graphbased pattern recognition and machine learning,? in Structural, Syn-tactic, and Statistical Pattern Recognition, pp. 287?297, Springer, 2008.[49] S. Jouili and V. Vansteenberghe, ?An empirical comparison of graphdatabases,? in Social Computing (SocialCom), 2013 International Con-ference on, pp. 708?715, IEEE, 2013.[50] M. Buerli and C. P. S. L. Obispo, ?The current state of graphdatabases,? 2012.[51] A. Popescu, ?A comparison of 7 graph databases,?2013. http://nosql.mypopescu.com/post/40759505554/a-comparison-of-7-graph-databases, [retrieved: Feb. 2, 2013].[52] M. C. Alekh Jindal, ?Benchmarking graph databases,? 2013. http://istc-bigdata.org/index.php/benchmarking-graph-databases,[retrieved: Jan. 29, 2013].[53] R. Angles and C. Gutierrez, ?Survey of graph database models,? ACMComputing Surveys (CSUR), vol. 40, no. 1, p. 1, 2008.101Appendix ADeterministic CandidatesIn this appendix we list all the object Web applications that we crawled tofind deterministic applications. A deterministic was required for forming anoracle for correctness experiments. The list is as following:http://www.heatcityreview.com/somervillenews.htmhttp://www.martinkrenn.nethttp://www.kastanova.nlhttp://www.al-awda.orghttp://thething.ithttp://www.engruppo.comhttp://www.turbopatents.comhttp://www.ece.ubc.cahttp://www.ubc.cahttp://www.cultofmac.comhttp://www.python.orghttp://metrics.codahale.comhttp://www.uss.dehttp://www.facebook.comhttps://www.google.cahttp://www.bing.comhttps://mail.google.comhttps://github.comhttps://code.google.com/p/guava-librarieshttp://www.vogella.comhttp://www.cacno.orghttp://sedo.co.uk/search/details.php4?domain=thestart.nethttp://www.iltasanomat.fihttp://www.fairvote.orghttp://www.beehive.nuhttp://www.littlewhitedog.comhttp://www.provincetown.comhttp://www.expedia.ca/?semcid=ni.ask.12908&kword=DEFAULT.-102Appendix A. Deterministic CandidatesNNNN.kid&rfrr=Redirect.From.www.expedia.com/Home.htmhttp://www.airtoons.comhttp://www.loco.pl/plhttp://www.grudge-match.com/current.htmlhttp://www.kastanova.nlhttp://www.acces-local.com/wordpresshttp://www.sfchronicle.comhttp://www.eldritch.comhttp://www.eldritch.comhttp://ruyguy15.150m.comhttp://www.bghs.orghttp://www.axis-of-aevil.nethttp://www.infoshop.orghttp://www.introducingmonday.co.ukhttp://www.linuxdevcenter.com/pub/a/linux/2000/06/29/hdparm.htmlhttp://www.twentysevenrecords.comhttp://www.hallwalls.orghttp://www.justfood.orghttp://bshigley.tumblr.comhttp://www.ottawacitizen.com/index.htmlhttp://emeraldforestseattle.com/forums/ubbthreadshttp://www.cancernews.com/default2.asphttp://www.usablenet.comhttp://www.viz.com/narutohttp://www.viz.com/narutohttp://www.needcoffee.comhttp://www.libraryplanet.comhttp://www.coversproject.comhttp://windows.microsoft.com/en-us/windows/support#top-solutions=windows-8http://buffalo.bisons.milb.com/index.jsp?sid=t422http://www.b3ta.comhttp://antiadvertisingagency.comhttp://www.sfbike.orghttp://www.grimemonster.comhttp://www.threadless.comhttps://wilwheaton.nethttp://www.hasbrouck.orghttp://www.uberbin.nethttp://www.gaijinagogo.com103Appendix A. Deterministic Candidateshttp://lalibertad.com.co/dia/p0.htmlhttp://www.wrightfield.comhttp://www.modernhumorist.comhttp://www.bcdb.comhttp://desktopgaming.comhttp://www.metalbite.comhttp://nowyoulistentomelittlemissy.blogspot.cahttp://www.cimgf.comhttp://www.paulmadonna.comhttp://dawnm.comhttp://typicalculture.com/wordpresshttp://www.vegweb.comhttp://www.newdream.orghttp://www.isc.org/downloads/BIND/http://hunyyoung.comhttp://home.planet.nl/ mooij321http://www.antique-hangups.comhttp://www.project451.com//http://lmuwnmd.wpengine.com/wp-signup.php?new=www.techblog.com//http://c-level.orghttp://rprogress.org/index.htmhttp://www.math.mcgill.ca104

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0165885/manifest

Comment

Related Items