Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Hidden-web induced by client-side scripting : an empirical study Behfarshad, Zahra 2014

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2014_september_behfarshad_zahra.pdf [ 455.4kB ]
Metadata
JSON: 24-1.0167437.json
JSON-LD: 24-1.0167437-ld.json
RDF/XML (Pretty): 24-1.0167437-rdf.xml
RDF/JSON: 24-1.0167437-rdf.json
Turtle: 24-1.0167437-turtle.txt
N-Triples: 24-1.0167437-rdf-ntriples.txt
Original Record: 24-1.0167437-source.json
Full Text
24-1.0167437-fulltext.txt
Citation
24-1.0167437.ris

Full Text

Hidden-Web Induced by Client-Side Scripting: AnEmpirical StudybyZahra BehfarshadB.Sc., Islamic Azad University Tehran Central Branch, 2011A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of Applied ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Electrical and Computer Engineering)The University of British Columbia(Vancouver)May 2014© Zahra Behfarshad, 2014AbstractClient-side JavaScript is increasingly used for enhancing web application func-tionality, interactivity, and responsiveness. Through the execution of JavaScriptcode in browsers, the DOM tree representing a webpage at runtime, can be in-crementally updated without requiring a URL change. This dynamically updatedcontent is hidden from general search engines. We present the first empirical studyon measuring and characterizing the hidden-web induced as a result of client-sideJavaScript execution. Our study reveals that this type of hidden-web contentis prevalent in online web applications today: from the 500 websites we ana-lyzed, 95% contain client-side hidden-web content; On those websites that con-tain client-side hidden-web content, (1) on average, 62% of the web states arehidden, (2) per hidden state, there is an average of 19 kilobytes of data that ishidden from which 0.6 kilobytes contain textual content, (3) the DIV element isthe most common clickable element used (61%) to initiate this type of hidden-webstate transition, and (4) on average 25 minutes is required to dynamically crawl 50DOM states. Further, our study indicates that there is a correlation between DOMtree size and hidden-web content, but no correlation exists between the amount ofJavaScript code and client-side hidden-web.iiPrefaceThis thesis presents an empirical study of client-side hidden-web content con-ducted by the author under the supervision of Dr. Ali Mesbah. I was responsiblefor implementing the tool, collecting URLs, running the experiments, evaluatingand analyzing the results, and writing the manuscript. My supervisor guided mewith the creation of the experimental methodology, the analysis of the results, aswell as editing and writing portions of the paper.The results of this study were published as a full paper in the Proceedings ofthe International Conference on Web Engineering (ICWE) [4] in 2013. We arealso proud to announce that our paper received the Best Paper Award at ICWE2013.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . 32 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . 42.1 The Hidden-Web . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 JavaScript Background . . . . . . . . . . . . . . . . . . . . . . . 72.3 Document Object Model (DOM) . . . . . . . . . . . . . . . . . . 82.4 Hidden-content Example . . . . . . . . . . . . . . . . . . . . . . 8iv2.5 Client-Side Hidden-Web Content . . . . . . . . . . . . . . . . . . 103 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.1 Crawling the Hidden-Web . . . . . . . . . . . . . . . . . . . . . . 123.2 Measuring the Hidden-Web . . . . . . . . . . . . . . . . . . . . . 134 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . 154.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . 164.3 Experimental Objects . . . . . . . . . . . . . . . . . . . . . . . . 165 Client-Side Hidden-Web Analysis . . . . . . . . . . . . . . . . . . . 185.1 Event-Driven Dynamic Crawling . . . . . . . . . . . . . . . . . . 185.1.1 State Exploration. . . . . . . . . . . . . . . . . . . . . . . 185.1.2 Crawling Configuration. . . . . . . . . . . . . . . . . . . 205.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.3 Characterization Analysis . . . . . . . . . . . . . . . . . . . . . . 225.3.1 Hidden-Web Quantity . . . . . . . . . . . . . . . . . . . . 225.3.2 Clickable Types . . . . . . . . . . . . . . . . . . . . . . . 235.3.3 Correlations . . . . . . . . . . . . . . . . . . . . . . . . . 235.4 Tool Implementation . . . . . . . . . . . . . . . . . . . . . . . . 256 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266.1 Pervasiveness (RQ1) . . . . . . . . . . . . . . . . . . . . . . . . 266.2 Quantity (RQ2) . . . . . . . . . . . . . . . . . . . . . . . . . . . 316.3 Induction (RQ3) . . . . . . . . . . . . . . . . . . . . . . . . . . . 336.4 Correlations (RQ4) . . . . . . . . . . . . . . . . . . . . . . . . . 367 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447.1 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . 447.1.1 Internal Validity . . . . . . . . . . . . . . . . . . . . . . . 447.1.2 External Validity . . . . . . . . . . . . . . . . . . . . . . 45v7.1.3 Construct Validity . . . . . . . . . . . . . . . . . . . . . . 477.1.4 Conclusion Validity . . . . . . . . . . . . . . . . . . . . . 477.2 Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . 49Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51viList of TablesTable 6.1 Descriptive statistics of the percentage of client-side hidden-web states in ALEXA, RANDOM, and TOTAL. . . . . . . . . . 28Table 6.2 Hidden-Web Analysis Results. The first 10 are from ALEXA,and the remaining 10 from RANDOM. . . . . . . . . . . . . . . 29Table 6.3 Descriptive statistics of the average hidden-web content for allstates and per state. . . . . . . . . . . . . . . . . . . . . . . . . 32Table 6.4 Descriptive statistics of the DOM size in ALEXA, RANDOM,and TOTAL. . . . . . . . . . . . . . . . . . . . . . . . . . . . 38Table 6.5 Descriptive statistics of the JavaScript custom code size inALEXA, RANDOM, and TOTAL. . . . . . . . . . . . . . . . . . 38viiList of FiguresFigure 2.1 JavaScript code for updating the DOM after a click event. . . 9Figure 2.2 The initial DOM state. . . . . . . . . . . . . . . . . . . . . . 9Figure 2.3 The updated DOM tree after clicking on ‘Update!’. . . . . . . 10Figure 5.1 Overview of our client-side hidden-web analysis. . . . . . . . 19Figure 6.1 Pie chart representing the percentage of websites exhibitingclient-side hidden-web content from all the 500 websites. . . . 27Figure 6.2 Box plots of the percentage of client-side hidden-web statesin ALEXA, RANDOM, and TOTAL. . . . . . . . . . . . . . . . 28Figure 6.3 Pie chart displaying the use of different clickables throughoutthe 500 websites. . . . . . . . . . . . . . . . . . . . . . . . . 34Figure 6.4 Pie chart showing hidden-web percentage behind different typesof clickables. ‘A INVIS’ represents anchor tags without a(valid) URL. ‘MG INVIS’ represents IMG elements not em-bedded in an anchor tag with a (valid) URL. . . . . . . . . . . 35Figure 6.5 Bar chart of Alexa categories versus hidden-web state percent-age. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37Figure 6.6 Scatter plots of the DOM size versus hidden-web state per-centage. ‘r’ represents the Spearman correlation coefficientand ‘p’ is the p-value. . . . . . . . . . . . . . . . . . . . . . . 39viiiFigure 6.7 Scatter plots of the DOM size versus hidden-web content. ‘r’represents the Spearman correlation coefficient and ‘p’ is thep-value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40Figure 6.8 Scatter plots of the JavaScript size versus hidden-web statepercentage. ‘r’ represents the Spearman correlation coeffi-cient and ‘p’ is the p-value. . . . . . . . . . . . . . . . . . . . 41Figure 6.9 Scatter plots of the JavaScript size versus hidden-web con-tent. ‘r’ represents the Spearman correlation coefficient and‘p’ is the p-value. . . . . . . . . . . . . . . . . . . . . . . . . 42ixAcknowledgmentsI would like to thank my advisor Dr. Ali Mesbah for giving me this opportunity tobe a part of his lab. His knowledge, motivation, inspiration, support and kindnessthrough out my years of studying always encouraged me to think more criticallyand focus on my work more professionally. During his honourable supervision Ilearnt many things and appreciate it. I am proud to say that these years were oneof the best years of my life and the most memorable.I would also like to thank my family specially my parents for their unlim-ited support and love. Without them pursuing this degree would have been verydifficult and cumbersome. My brothers, Alireza and Mohammad, too helped mebecome happy and keep my hopes up.Finally, I would like to appreciate from a very dear friend, Shabnam Mir-shokraie, for all of her support, kindness and guidance. She always motivated meto follow my dreams and never give up.Last but not least, I also like to thank Prof. Philippe Kruchten and Prof.Karthik Pattabirman for accepting to be apart of my defence committee.xDedicationTo my family and friendsxiChapter 1IntroductionGeneral web search engines cover only a portion of the web, called the visibleor indexable web, which consists of the set of web pages reachable purely byfollowing URL-based links.There is, however, a large body of valuable web content that is not accessi-ble by simply following hypertext links. Well-known examples include dynamicserver-side content behind web forms [3, 29], reachable through application-specificqueries. This portion of the web, not reachable through search engines, is gener-ally referred to as the invisible or hidden web, which, in 2001, was estimated tobe 500 times larger than the visible web [5]. More recently, form-based hiddenweb content has been estimated at several millions of pages [3, 17].With the wide adoption of client-side programming languages such as JavaScriptand AJAX techniques to create responsive web applications, there is a new type ofhidden-web that is growing rapidly. Although there has been extensive researchon detecting [3, 20, 29] and measuring [5, 15] hidden-web content behind webforms, hidden-web induced as a result of client-side scripting has gained limitedattention so far.JavaScript is the dominant language for implementing dynamic web appli-cations. Today, as many as 97 of the top 100 most visited websites [1], haveclient-side JavaScript [31], often consisting of thousands of lines of code per1application. JavaScript is increasingly used for offloading core functionality tothe client-side and achieving rich web interfaces. In most Web 2.0 applications,JavaScript code extensively interacts with and incrementally updates the Doc-ument Object Model (DOM) at runtime in order to present new state changesseamlessly. Changes made dynamically to the structure, contents or styles of theDOM elements are directly manifested in the browser’s display. This event-basedstyle of interaction is substantially different from the traditional URL-based pagetransitions through hyperlinks, where the entire DOM is repopulated with a newHTML page from the server for every user-initiated state change.1.1 ObjectiveOur goal is to measure the pervasiveness and characterize the nature of hidden-web content induced by client-side JavaScript in today’s web applications. There-fore, the main focus of our study is to understand how client-side scripting lan-guages, specially JavaScript , contributes to hidden-web content by measuring thepercentage of hidden-web in websites. In addition, we attempt to identify anycorrelation between the hidden web content and the DOM size along with theJavaScript custom code size.For simplicity, we will refer to this type of hidden-web content as client-sidehidden-web throughout the thesis. To the best of our knowledge, we are the firstto conduct an empirical study on this topic.1.2 ContributionsOur study makes the following main contributions:• A systematic methodology and tool, called JAVIS, to automatically analyzeand measure client-side hidden-web content;• An empirical study, conducted on 500 online websites, pointing to the ubiq-uity and pervasiveness of this type of hidden-web content in today’s web-2sites. Our results show that 95% of the websites we analyzed contain somedegree of client-side hidden-web content; On those websites with client-side hidden-web, on average (1) 62% of the crawled web states are hidden,(2) there is around 19 kilobytes of DOM content that is hidden per hid-den state whereas 0.6 kilobytes are only content, (3) the DIV element isthe most commonly used (61%) clickable type contributing to client-sidehidden-web content;• A discussion of the implications of our empirical findings. Our study indi-cates that there is a possible correlation between the size of the DOM treeand the hidden-web content, but surprisingly, no strong correlation existsbetween the amount of JavaScript code and client-side hidden-web.The results of our study were published as a full paper [4] in the Proceed-ings of the International Conference on Web Engineering (ICWE) in 2013, whichreceived the Best Paper Award.1.3 Thesis OrganizationIn Chapter 2, we provide background information regarding web applications,particularly with respect to the use of client-side JavaScript , along with the mo-tivation to conduct this study. Chapter 3 discusses the related work in this areaof research. Chapter 4 describes in detail the experimental methodology used tomeasure the hidden-web content induced by JavaScript . In Chapter 5, the methodused to pursue our research questions along with how we analyze the collecteddata to perform the evaluation is provided. Chapter 6 presents the results ac-quired from this study and answers the research questions. Chapter 7 discussesthe implications our results have on web application programmers, testers, andtool developers, and some of the validity threats in our study. Finally, Chapter 8concludes our work and presents future research directions.3Chapter 2Background and Motivation2.1 The Hidden-WebWith the rapid growth of data embedded inside web applications, the Internet hasbecome a main source of information today. To retrieve relative information frombillions of existing web applications, search engines require a means to crawl andindex the data efficiently and effectively.In reality, search engines employ what is called crawlers or spiders. Thesecrawlers automatically inspect web pages and create a copy of the visited pages.These actions are repeated daily to index useful content within the web applicationor simply update the resources in order to facilitate future searching. Therefore,search engines only recall data within pages which were previously toured bycrawlers.A question regarding search engines and indexing is, how much of the Webcontent can be searched and retrieved by the current search engines? In order toanswer this question, we first explain two divisions of the Web: The Visible Weband the Hidden Web.The visible Web (also known as the surface/indexable Web) [5] is the portionof the Web where search engines are capable to index and retrieve the content.In contrast, the hidden Web (also known as the deep/invisible Web, deepnet, or4darknet) [5, 29] is the part in which conventional search engines are not able toindex the content and thus, it is invisible to the users through search engines.There are eight major causes for the deep Web: dynamic content, unlinkedcontent, private text, contextual Web, limited access content, scripted content,non-HTML/text content and text content using Gopher/FTP protocol. Each ofthese causes are described as the following:Dynamic Content. Dynamic content refers to dynamic generation of informa-tion which are returned in response to a submitted query or accessed onlythrough a form.This type of content is associated with forms, databases and the data storewithin the databases. Once the user completes a form and submits it, re-lated information corresponding to the search query is retrieved from thedatabases and presented. Today many websites are using databases andforms to benefit from the advantages it provides for both users and the web-site owners.A web application might be composed of one or many databases either en-tirely connected or independent. Either way, it is very difficult for crawlersto gain access to the databases and extract information. Thus, the informa-tion reserved inside databases are indeed aggregated to the hidden-web.Unlinked Contents. Any information staged within web pages which are notlinked to by other pages become hidden from the user. Clearly, if the webpage is not linked to other web pages of a website, crawlers won’t gainany access to them since there is no connection between the web pages tocontinue with the crawling.Private Websites. Private websites are described as websites that require regis-tration and login (password-protected resources). Today, many of the web-sites require their users to acquire a username and password to be able toobtain access after signing in. This can be seen in forums and social net-works where people tend to communicate with another.5One of the goals of requesting registration and logging in is to protect theuser’s information and responses from adversaries. Although it is rationale,yet it prevents the crawlers from reaching the content within the websitesand therefore, the data shift towards the hidden-web.Contextual Web. Refers to pages with content varying by different access con-texts. This means that based on the different access types, for example thelocation of the user or the IP address, different content is presented. Sincethe web page might contain and display a variety of details and data, searchengines will be able to index a partial of the data and the rest grow insidethe hidden-web.Limited Access Content. Limited access content is sites that limit access to theirpages in a technical way. What is meant by technical way, are any meansthat prohibit search engines from inspecting the web pages such as CAPTH-CAs or using the robots exclusion standard.CAPTCHAs are usually used to detect whether it is a real user or simplya robot. To pursue this, each time a random statement (either a questionsis asked or a word ) is generated by default which the user has to type.Since the statements are generated randomly, robots cannot reply and thus,the website will not be browsed. Regarding the exclusion standard, webdevelopers have the ability to block out the crawlers from investing theirweb applications.Scripted Content. Scripted content is any type of information that is produced byscripting languages such as the most popular scripting language, JavaScript. This content is usually generated dynamically and since the search enginesdo not execute JavaScript code they become hidden content.Non-HTML/Text Content. Non-HTML/text content is textual contents encodedwithin multimedia. It is obvious, that any text embedded within videos,6audio and images cannot be searchable since they are not crawled in thebeginning.Although there are many reasons behind hidden-web content as mentionedabove, yet only a few of them are studied. These studies [3, 11, 12, 20, 22, 29, 30].mostly focus on the portion of hidden-web content behind forms and databasesand other causes have gained little attention. In this study, we focus on the contentdynamically generated by scripting languages and in particular by JavaScript . Asmentioned before, our goal is to understand how JavaScript contributes to andcorrelates with hidden content on the web.2.2 JavaScript BackgroundWith the wide adoption of client-side programming languages such as JavaScriptand AJAX techniques to create responsive web applications, there is a new type ofhidden-web that is growing rapidly.JavaScript plays an important role when interacting with web applications.It is the language of the browser and by writing JavaScript code, we can simplygenerate dynamic code and thus, enhance the responsiveness and interactions ofthe web applications.Although JavaScript might have many disadvantages, such as increasing vul-nerability in web applications and browsers, yet the advantages overweight them.One of the most powerful abilities it provides is the capability of easily modifyingthe DOM tree.By using a JavaScript function, the developer can simply gain access to anyDOM element within the web page. Once the element is obtained, based on thepurpose any sort of alteration can be carried out. These modifications can be ofany sort such as removing or appending new nodes to an element, modifying thetextual content of the element, altering the attributes of the element or their valuesand so on. A clear example of DOM modification is provided in section 2.4.The JavaScript engine interprets and executes the relative code and the DOM7tree will be updated. Once the update is complete, the browser interprets the newDOM tree and the web page is reconstructed containing new data.2.3 Document Object Model (DOM)The fundamental and main part of a web application is the HTML elements. Theseelements are the crucial components for a web application to exist. There is nolimitation on how and where to use the elements and it only varies in terms of thedesign and functionality of the web page.The DOM, is a model representing the structure of the HTML elements. Itdescribes how the elements are connected with each other, and how you can gainaccess to them.By calling an element within a web page with its appropriate name, the devel-oper can yield knowledge of its attributes, their values and the text it holds. In theDOM tree, each element is referred to as a node. A node can be either of these twocases: be both a parent node and a child node or only be a child node. The DOMtree will also present the attribute nodes. In other words, not only the elementsand text are added to the DOM tree, but also the attributes of the elements can beappended.One of the powerful advantages the DOM tree provides for the developer isthe ability to walk the DOM tree, identify the element she desires and modifyit. This action can be pursued easily by simply obtaining a node. However, inorder to alter the DOM tree, specific API (Application Programming Interface)are required. These API are used in scripting languages such as JavaScript , forexample “getElementById(‘news’)” in the following example.2.4 Hidden-content ExampleWe present a simple example of how JavaScript code can induce hidden-webcontent by dynamically changing the DOM tree. Figure 2.1 depicts a JavaScript81 $(document).ready(function() {2 $('div.update').click(function() {3 var updateID = $(this).attr('rel');4 $.get('/news/', { ref:updateID },5 function(data) {6 $(updateID+'Container').append(data); }); })←↩});Figure 2.1: JavaScript code for updating the DOM after a click event.<body><h1>Sports News</h1><p><span id="sportsContainer"></span></p><div class="update" rel="sports">Update!</div></body>Figure 2.2: The initial DOM state.code snippet using the popular jQuery library.1 Figure 2.2 illustrates the initialstate of the DOM before any modification has occurred.Once the page is loaded (line 1 in Figure 2.1), the JavaScript code attaches anonclick event-listener to the DIVDOM element with class attribute ‘update’(line 2). When a user clicks on this DIV element, the anonymous function asso-ciated with the event-listener is executed (lines 2–8). The function then sends anasynchronous call to the server (line 4), passing a parameter read from the DIVelement (i.e., ‘sports’) (line 3). On the callback, the response content from theserver is injected into the DOM element with ID ‘sportsContainer’ (line6).The resulting updated DOM state is shown in Figure 2.3. All the data retrievedand injected into the DOM this way will be hidden content as it is not indexed bysearch engines. Although the effect of client-side scripting on the hidden-web isclear, there is currently a lack of comprehensive investigation and empirical datain this area.1 http://jquery.com9<body><h1>Sports News</h1><p><span id="sportsContainer"><h3>US GP: Vettel fastest in Austin second ←↩practice</h3><p>Vettel produced an ominous performance</p></←↩span></p><div class="update" rel="sports">Update!</div></body>Figure 2.3: The updated DOM tree after clicking on ‘Update!’.2.5 Client-Side Hidden-Web ContentClient-side scripting empowers achieving dynamic and responsive web interfacesin today’s web applications. The most widely used language for client-side script-ing is JavaScript , which is an event-driven dynamic and loosely typed interpretedprogramming language.JavaScript is supported by all modern web browsers across multiple plat-forms including desktops, game consoles, tablets, and smartphones. ThroughJavaScript , the client-side runtime DOM tree of a web application can be dy-namically updated with new structure and content. These updates are commonlyinitiated through event-listeners, AJAX callbacks, and timeouts. The new content,either originated from the server-side or created on the client-side, is then injectedinto the DOM through JavaScript to represent the new state of the application.Although DOM-based manipulation through JavaScript increases responsive-ness of web applications, these dynamically injected contents end up in the hidden-web portion of the web. The main reason is that crawling such dynamic contentis fundamentally more challenging and costly than crawling classical multi-pageweb applications, where states are explicit and correspond to pages that have aunique URL assigned to them. Moreover it is a basis for AJAX-based web appli-cations, which has become very popular in the past decade.AJAX (Asynchronous JavaScript and XML) is an umbrella term for some10technologies which allow asynchronous server communication without requestinga completely new version of the page. Some well-known representative examplesinclude Gmail and Google Docs.In modern web applications, however, the client-side state is determined dy-namically through changes in the DOM that are only visible after executing thecorresponding JavaScript code. The major search giants, have currently little orno support for dynamic analysis of JavaScript code due to scalability and securityissues. They basically crawl and extract hypertext links and index the resultingHTML code recursively.11Chapter 3Related WorkIn order to discuss previous work in more details, we classify the related workinto two major categories: (1) crawling the hidden-web, and (2) measuring thehidden-web.3.1 Crawling the Hidden-WebCrawling techniques have been studied since the advent of the Web itself [6, 7, 9,16, 28]. Web crawlers find and index millions of HTML pages daily by searchingfor hyperlinks. Yet a large amount of data is hidden behind web queries andtherefore, extensive research has been conducted towards finding and analyzingthe hidden-web, also called deep-web, behind web forms [3, 11, 12, 20, 22, 29,30].The main focus in this line of research is on exploring ways of detecting queryinterfaces and accessing the content in online databases, which is usually behindHTML forms. HiWE [29] is one of the very first proposed hidden-web crawlerswhich tries a combination of different query values for HTML forms.Ntoulas et al. [27] focus on an effective hidden-web crawler that discoversand downloads pages by simply generating queries based on pre-defined searchpolicies. Barbosa et al. [3] propose an adaptive crawling strategy to locate online12databases by searching extensively through web forms, avoiding crawling anyunrelated pages.Carvalho et al. [12] present a method, called SmartCrawl, for retrieving in-formation behind HTML forms by automatically generating agents that fill outforms. Similarly, Palmieri et al. [20] automatically generate fetching agents toidentify hidden-web pages by filling out HTML forms.Liakos and Ntoulas [21] have recently proposed a topic-sensitive hidden-webcrawling approach. Khare et al. [18] provide a comprehensive survey of theresearch work in search interfaces and keywords.This line of research is merely concerned with server-side hidden-web content(i.e., in databases). On the contrary, exploring the hidden-web induced as a resultof client-side scripting has gained very little attention so far.Alvarez et al. [2] discussed the importance and challenges of crawling client-side hidden-web. Mesbah et al. [24, 25] were among the first to propose an auto-mated crawler, called CRAWLJAX, for eAJAX-based web applications. CRAWL-JAX automates client-side hidden-web crawling by firing events and analyzingDOM changes to recursively detect a new state. Duda et al. [13] presented howDOM states can be indexed. The authors proposed a crawling and indexing algo-rithm for client-side state changes.3.2 Measuring the Hidden-WebResearchers have reported their results of measuring the hidden-web behind forms.In 2001, Bergman [5] reported a study indicating that the hidden-web was about500 times larger than the visible web.In 2004, Chang et al. [8] measured hidden-web content in online databasesusing a random IP-sampling approach, and found that the majority of the data insuch databases is structured.In 2007, He et al. [15] conducted a study using an overlap analysis techniquebetween some of the most common search engines such as Yahoo!, Google, andMSN and discovered that 43,000-96,000 deep websites existed. They presented13an informal estimate of 7,500 terabytes of hidden data, which was 500 times largerthan the visible web, which supported the earlier results by Bergman. All this re-lated work focuses on measuring server-side hidden-web behind forms.To the best of our knowledge, we are the first to study and measure client-sidehidden-web.14Chapter 4MethodologyIn this chapter, the research questions and the fundamentals of the methodologyconsidered for this study are each explained separately in the following sections.It should be noted, the approach used for analyzing the subjects of the experimentsare explained in full description in the next chapter.4.1 Research QuestionsThe main goal of our empirical study is to measure the pervasiveness and charac-terize the nature of hidden-web content induced by client-side scripting. By un-derstanding these characteristics we gain knowledge of how much information ishidden, whether it is possible to reduce this amount of hidden content and transferthem towards visible content. It should be noted that we only focus on the onclickevent type in this study since it is most widely used among web application.Our research questions are formulated as follows:RQ1: How pervasive is client-side hidden-web in today’s web applications?RQ2: How much content is typically hidden due to client-side scripting?RQ3: Which clickable elements contribute most to client-side hidden-web con-tent?15RQ4: Are there any correlations between the degree of client-side hidden-weband a web application’s characteristics?4.2 Experimental DesignTo investigate the pervasiveness of hidden content due to client-side scripting(RQ1), we examine all the 500 websites and count the percentage of websitesthat exhibit client-side hidden-web content. In addition, for each of those web-sites that does contain hidden-web content, we measure what percentage of thecrawled states is hidden.To measure the amount of content that is hidden (RQ2), we compute the totaland average of hidden content in terms of differences between each hidden stateand its previous state regardless of whether it is hidden or visible. It should benoted, we consider two cases, hidden content only containing the textual differ-ences and hidden content comprising of both textual and DOM differences.Regarding RQ3, we analyze the distribution of the elements used among allweb applications at first. Furthermore, we classify the clickable elements thatexercising them results in hidden states in our analysis to address this researchquestion. In other words, we assess what type of DOM elements are commonlyused in practice by web developers that induce this type of dynamic JavaScript-driven state change.In order to answer RQ4, we analyze correlations between the client-side hidden-web content and the average DOM size along with custom JavaScript code of eachwebsite examined. In the next chapter, technical details of our analysis approachare explained clearly.4.3 Experimental ObjectsIn this study, we analyze 500 unique websites in total. To obtain a representa-tive pool of websites, similar to other researchers [19, 31], we select 500 uniquewebsites from Alexa’s Top Sites [1] (henceforth referred to as ALEXA). However,16ALEXA contains websites that are exactly the same (same domain) but hostedin different countries. Therefore, for multiple instances of the same domain onAlexa’s top list (e.g., www.google.com, www.google.fr), we only include and countone instance. This leads to a total number of 400 objects in our list.Since the 400 websites are all selected from ALEXA based on popularity, wegather another 100 random websites additionally using Yahoo! random link gen-erator (henceforth referred to as RANDOM), which is also used in other studies[10, 23]. The purpose of taking this action is to gain a more generalizable conclu-sion after evaluating the results.It should be noted all the 500 websites (henceforth referred to as TOTAL) werecrawled and analyzed throughout February 2013.17Chapter 5Client-Side Hidden-Web AnalysisFigure 5.1 depicts our client-side hidden-web content analysis technique which iscomposed of three main steps: (1) dynamically crawling each given website, (2)classifying the detected state changes into visible and hidden categories, and (3)conducting characterization analyses of the hidden states. Each step is describedin the subsequent subsections.5.1 Event-Driven Dynamic Crawling5.1.1 State Exploration.To automate the crawling step, we use and extend our AJAX crawler, called CRAWL-JAX [25]. CRAWLJAX is capable of automatically exploring JavaScript -inducedDOM state changes through a dynamic event-based crawling technique. Our ap-proach for automatically exploring a web application’s state space is based onour CRAWLJAX [25] work. CRAWLJAX is a crawler capable of automatically ex-ploring JavaScript -induced DOM state changes through an event-driven dynamiccrawling technique. It exercises client-side code, detects and executes clickablesthat lead to various dynamic states of Web 2.0 AJAX-based web applications. Byinserting random or user-specified data and firing events on the web elements and18Generate eventGetURLServerBrowser(1)CrawlState-flow GraphRandomAlexaAnalyzeDOM(2)ClassifyclickDOM changedupdateVisibleStatesHiddenStates(3)CharacterizeFigure 5.1: Overview of our client-side hidden-web analysis.analyzing the effects on the dynamic DOM tree in a real browser before and afterthe event, the crawler incrementally builds a state-flow graph (SFG) [25] captur-ing the client-side states and possible event-based transitions between them. Thisstate-flow graph is defined as follows:A state-flow graph SFG for an AJAX-based website A is a label, directedgraph, denoted by a 4 tuple < r,V ,E,L > where:r is the root node (called Index) representing the initial state when A has beenfully loaded into the browser.V is a set of vertices representing the states. Each v ∈ V represents a runtimeDOM state in A.E is a set of (directed) edges between vertices. Each (v1,v2) ∈ E representsa clickable c connecting two states if and only if state v2 is reached byexecuting c in state v1.L is a labelling function that assigns a label, from a set of event types and DOMelement properties, to each edge.SFG can have multi-edges and be cyclic.19CRAWLJAX is also capable of crawling traditional URL-based websites. Itis fully configurable in terms of the type of elements that should be examinedor ignored during the crawling process. For more details about the architecture,algorithms or capabilities of CRAWLJAX the interested reader is referred to [25,26].15.1.2 Crawling Configuration.We have extended, modified, and configured CRAWLJAX for this study as follows:Maximum states. To constrain the state space and still acquire a representativesample for our analysis, we define an upper limit on the number of states tocrawl for each website, namely, 50 unique DOM states.Setting the maximum number of DOM states to 50, is based on a pilot studyconducted beforehand. Since it was only a pilot study, 10 random websiteswith different rankings were selected and crawled for four rounds. At thefirst round the maximum state crawling configuration was set to 25 statesand in each round afterwards, 25 more states were added. Therefore, eachwebsite was crawled with 25, 50, 75 and 100 states. However, it should benoted that not all websites resulted to 75 or 100 unique DOM states. In thosecases that did include 100 distinctive states, the hidden-web state percentagewas no different from the hidden-web state percentage of the crawls with50 states. Therefore, by analyzing the hidden-web state percentage and thenumber of states of all four rounds, the conclusion was to set the maximumstates to 50.Crawling depth. Similar to other studies [15], we set the crawling depth to 3.Candidate clickables. Traditionally, forms and anchor tags pointing to valid URLswere the only clickables capable of changing the state (i.e., by retrieving a1 http://crawljax.com20new HTML page from the server after the click). However, in modern web-sites, web developers can potentially make any HTML element to act asa clickable by attaching an event-listener (e.g., onclick) to that element.Such clickables are capable of initiating DOM mutations through JavaScriptcode. In our analysis, we include the most commonly used clickable ele-ments, namely: A, DIV, SPAN, IMG, INPUT and BUTTON.Event type. We specify the event type to be click. This means the crawlerwill generate click events on DOM elements that are spotted as candidateclickables, i.e., elements potentially capable of changing the DOM state.Randomized crawling. In order to get a simple random sample, we randomizethe crawling behaviour in terms of selecting the next candidate clickable forexploration.Once the tool is configured, we automatically select and crawl each exper-imental object, and save the resulting state-flow graph containing the detectedstates (DOM trees) and transitional edges (clickables).5.2 ClassificationAs shown in Figure 5.1, for each website crawled, we classify the detected statesinto two categories: visible and hidden. Our client-side hidden-web analysis islargely based on the following two assumptions:1. A valid URL-based state transition can be crawled and indexed by searchengines and, therefore, it is visible;2. A non-URL-based state transition is not crawled nor indexed by search en-gines and thus, it ends up in the hidden-web; For instance, the DOM updatepresented in Figure 2.3, as a result of clicking on the DIV element of Fig-ure 2.2, is hidden.21To classify the crawled states, we traverse the inferred state-flow graph of eachwebsite. For each state, we analyze all the incoming edges (i.e., clickables). Ifnone of the incoming edges is a valid URL-based transition, we consider that stateto be a hidden state. Otherwise, it is visible.Each edge contains information about the type of clickable element that causeda state change. Our classification uses that information to decide which resultingstates are hidden as follows:Anchor tag (A). The anchor tag can produce both visible and hidden states, de-pending on the presence and URL validity of the value of its HREF attribute.For instance, clicking on <A HREF=‘www.eg.com/news/’> results ina visible state, whereas <A HREF=‘#’ onclick= ‘updateNews();’>can produce a hidden state.IMG. The image tag is also interesting since it can result in a visible state whenembodied in an anchor tag with a valid URL; For every edge of IMG type,we retrieve the parent element from the corresponding DOM state. If theparent element is an anchor tag with a valid URL, then we categorize theresulting state as visible, otherwise the state is hidden.Other element types. Per definition, DIV, SPAN, INPUT, and BUTTON do nothave attributes that can point to URLs, and thus, the resulting state changesare all categorized as hidden.5.3 Characterization AnalysisWe will analyze the following characteristics:5.3.1 Hidden-Web QuantityHidden-web Quantity: Once the explored states are categorized, we annotate thehidden states on the state-flow graph to measure the amount of hidden-web data in22those states. We traverse the annotated state-flow graph, starting from Index, andfor each annotated hidden state, we compute the differences between the previousstate (which could be a visible or hidden state) and the annotated hidden stateusing the unix difference. To measure the amount of data that can be hidden, thedifferentiation method computes merely the additions in the target (hidden) state.For each website, JAVIS saves all the differences in a file and measures the totalsize in bytes.5.3.2 Clickable TypesClickable types: To investigate which clickable type (i.e., A, DIV, SPAN, IMG,INPUT and BUTTON) contribute most to hidden-web content in practice, JAVISexamines the annotated state-flow graph and gathers the edges that result in hid-den states. It then calculates, for each element type, the mean of its contributionportion to the hidden-web percentage.5.3.3 CorrelationsCorrelations: Further, we measure the average DOM string size as well as thecustom JavaScript code (excluding common libraries such as jQuery, Dojo, Ext,etc) of each website. To examine the relationship between these measurementsand the client-side hidden-web content, we use R [14] to calculate the non-parametricSpearman correlation coefficients (r) as well as the p-values (p), and plot thegraphs. We present combinations that indicate a possible correlation.Before going into details regarding each each aspect, a brief description ofhow we intend to evaluate them is described. We use the Pearson correlationwhich is a widely used measure of linear dependency between two variables.The Pearson correlation coefficient is obtained by dividing the covariance ofthe two variables by the product of their standard deviations. The more it is closeto +1, the more positive (increasing) linear relationship exists, and the more it isclose to -1, the more negative (decreasing) linear relationship exists. As it ap-proaches zero there is less of a relationship (closer to uncorrelated). We should23note that a high correlation does not necessarily infer a causal relationship betweenthe variable, though it can indicate the potential existence of causal relations.DOM Size: To find the dependence and effectiveness of DOM elements (Href,Div, Span, Img, Input and Button) on the percentage of invisibility, we study thestatistical relationship between the DOM elements as random variables and thehidden-web rate. It should be noted that a web site containing a huge amount ofelements that directs to large DOM size does not necessarily contribute to higherinvisibility rate and vice versa. The DOM size obtained for each web applicationis an average of the total DOM size and is presented in terms of kilo bytes.JavaScript code size: In order to find the correlation between JavaScript codesize and hidden-web, a tool is used with the specific purpose of measuring thisparameter. Although many tools exist that can aid us in this regard yet we chosethe Web Developer tool which is an open source tool which easily provides uswith the information required.With the aid of this tool, the JavaScript code size is can be obtained as com-pressed or uncompressed size in kilo bytes based on the users opinion and there-fore, in our case we use the compressed size. Note that the JavaScript code sizeprovided by Web Developer includes all the script files which can be constitutedof any JavaScript library used for that web application along with the customJavaScript files written by the developer.Although we obtain Pearson correlation coefficient between this aspect andthe hidden-web, yet we should keep in mind that the result might not be so re-markable. This does not mean that if code size is small than it leads to morehidden content and vice versa.Categories: An advantage that ALEXA provides us with is the opportunity tosimply classify the collected web sites into the already specified categories de-fined by them.24ALEXA provides three different options to choose from the top best web sitetitle which are: 1) Top best web sites distributed Globally , 2) Top best web sitesbased on Country and 3) Top best web sites based on Category. Since our goalis to obtain comprehensive data, we selected the first option in order to presentus with the universal web sites. However, in order to understand the correlationbetween the categories and hidden web, we sort the obtained web sites into thepre-defined categories too.There are 17 categories in general which can be found at Alexa’s web site,yet for our purpose we manually grouped them into the different subjects and atotal of 15 categories were concluded. In our findings, we noticed that based onAlexa’s assortment, one web site can be considered to be grouped into multiplecategories. It should also be noted that this action is not applied for RANDOM websites.5.4 Tool ImplementationWe have implemented our client-side hidden-web analysis approach in a toolcalled JAVIS, which is open source and available for download along with allour empirical data from: http://salt.ece.ubc.ca/content/javis/JAVIS is implemented in Java, and is built as a plugin for CRAWLJAX [25].2All our experiments were conducted on a Debian-based machine running Firefoxas the embedded browser for CRAWLJAX.2 crawljax.com25Chapter 6ResultsIn this section, the results of our empirical study is provided. We present ourresults in the following subsections, each corresponding to a research question asformulated in Chapter 4, Methodology.Table 6.2 depicts a representative small sample (20 websites) of the kind ofwebsites we have crawled and the type of data we have gathered, measured, andanalyzed in this study. These websites are randomly selected from our total poolof 500 websites. The first 10 are taken from ALEXA and the second 10 fromRANDOM. It should be noted that the Total column in this table refers to boththe hidden DOM structure and the textual content whereas the Total Content onlyrefers to the hidden textual content. The Average column also refers to the averagehidden content and DOM structure per state. The complete set of our empiricaldata containing all information in the table is available for download.16.1 Pervasiveness (RQ1)To investigate the pervasiveness of client-side hidden-web, we crawled all the 500websites and analyzed the findings. The results are obtained distinctively for eachwebsite and as a whole for both set of websites: ALEXA and RANDOM.1 http://salt.ece.ubc.ca/content/javis/26 95% (476) 5% (24)Contains hidden−web contentNo hidden−web contentFigure 6.1: Pie chart representing the percentage of websites exhibitingclient-side hidden-web content from all the 500 websites.The pie chart in Figure 6.1 depicts the percentage of websites that fully or par-tially exhibit client-side hidden-web content. This means, we consider a websitetotally visible that does not contain any hidden content. In other words, if none ofthe crawled states are hidden, the website is referred to as a visible website. Un-fortunately, only 5% of the websites can be crawled by search engines and thus,visible to the user. This fraction in comparison to the billions of websites thatexist today in the real world, is extremely small. In contrast, 95% (476/500) of thewebsites analyzed exhibit some degree of client-side hidden-web content. Whatwe mean by some degree of hidden-web content, is that they have at least one ormore client-side hidden states.In addition, we wanted to gain knowledge of what percentage of these 476websites that lie in the portion of hidden websites, actually contain hidden-webcontent. To pursue this task, we separated the websites based on the resources col-lected and analyzed the results individually. Figure 6.2 presents three box plotsillustrating the hidden-web state percentages for the two resources ALEXA, RAN-27Alexa Random Total020406080100ResourcesHidden−Web (%)Figure 6.2: Box plots of the percentage of client-side hidden-web states inALEXA, RANDOM, and TOTAL.Table 6.1: Descriptive statistics of the percentage of client-side hidden-webstates in ALEXA, RANDOM, and TOTAL.Resource Min 1st Qua. Median Mean 3rd Qua. MaxALEXA 0 49 70 65.63 90 100RANDOM 0 10 44 50.6 98 100TOTAL 0 41 67 62.52 92 100DOM, and in general as the TOTAL. For more clarity and exact percentages re-garding the hidden content embedded within the web applications, Table 6.1 isprovided which presents the min, max, median, mean, and the 1st and 3rd quar-tiles of the results per resource and in total.28Table 6.2: Hidden-Web Analysis Results. The first 10 are from ALEXA, and the remaining 10 from RANDOM.Clickables Clickable Types Size HiddenID Site Name S ta te s( #)T ot alVi si bl eHi dd enA( Vi si bl e)A( Hi dd en )Di vS pa nI mg( Vi si bl e)I mg( Hi dd en )I np utBu tt onJ av aS cr ip t( KB)DOM( KB)S ta te s( %)T ot al (KB)T ot al Co nt en t( KB)Av er ag e( KB)Ti meEl ap se d( S)1 Google 50 49 3 46 3 0 29 16 0 1 0 0 329 210 94 906 13 18 2282 ESPN 50 49 12 37 6 0 26 2 6 9 0 0 161 196 75 4358 120 89 75653 AOL 50 49 8 41 5 1 18 22 3 0 0 0 203 170 82 4626 140 64 47274 Youtube 50 49 7 42 7 0 7 17 0 7 0 17 286 153 84 4230 153 86 5305 Aweber 50 49 24 25 16 1 20 0 8 4 0 0 41 31 65 38 0 0.78 7406 Samsung 50 49 3 46 2 0 42 3 1 0 0 1 96 267 92 1381 21 28 12747 USPS 50 49 8 41 5 1 33 7 3 0 0 0 200 258 82 563 6 11.5 3178 BBC 50 49 41 8 25 0 3 3 16 2 0 0 142 112 16 293 6 6 7949 Alipay 50 49 2 47 2 7 33 7 0 0 0 0 200 72 94 77 0 1.5 82810 Renren 50 49 0 49 0 0 49 0 0 0 0 0 100 47 100 1613 3 33 15211 EdwardRobertson 50 49 1 48 1 2 45 1 0 0 0 0 120 64 98 154 7 3.14 16112 Rayzist 50 49 31 18 31 1 16 0 0 1 0 0 329 54 37 257 38 5.2 97613 Metmuseum 50 49 3 46 3 0 2 0 0 44 0 0 54 87 94 935 68 19 36414 JiveDesign 50 49 0 49 0 0 49 0 0 0 0 0 241 202 100 369 0 7.5 32215 MTV 50 49 0 49 0 0 19 0 0 30 0 0 242 200 100 530 14 10.8 41716 Challengeair 50 49 0 49 0 0 49 0 0 0 0 0 176 28 100 22 0 0.45 14517 Mouchel 50 52 52 0 51 0 0 0 1 0 0 0 20 60 0 0 0 0 53518 Sacklunch 50 49 45 4 3 0 3 0 42 1 0 0 121 83 8 166 6 3.39 23619 Pongo 50 49 3 46 3 0 46 0 0 0 0 0 61 463 94 4229 58 83.3 71320 MuppetCentral 50 49 17 32 8 0 32 0 9 0 0 0 254 224 65 4807 272 98.1 96629ALEXA: For the 400 websites obtained from ALEXA, on average 65.63% of the50 states we analyzed were client-side hidden-web. This high number canbe explained by the nature of such websites perhaps. These websites areamong the top most visited sites in the world. It is clear that the develop-ers of many of these websites use the latest Web 2.0 technologies, such asJavaScript , DOM, Ajax, and HTML5, to provide high quality features thatcome with rich interaction and responsiveness.As can be seen in Table 6.1, the minimum and maximum of hidden-webare 0% and 100% respectively. It is obvious that some websites consideredusing only hypertext links whereas others preferred using only the latesttechnologies such as JavaScript to dynamically edit content. Therefore, thewebsites are either totally visible (0% hidden content) or completely hidden(100% hidden content). The first quartile, median and the 3rd quartilesare 49%, 70% and 90% respectively. Compared to the RANDOM websites,the first quartile and the median are much higher and this can be due toextensive use of new technologies to increase popularity and be rated as topbest websites.As we have discussed in Section 2, Background, these Web 2.0 techniquescontribute enormously to the creation of client-side hidden-web.RANDOM: Similarly, an average 50.6% of the states from the RANDOM web-sites constitute hidden states. These websites were purely randomly chosenon the Web with no previous background information about them. In otherwords, we do not have any idea about their rankings nor the functional-ity they provide. We were expecting to witness a lower percentage here,because (1) many websites on the Web might still be classical in nature,without using any modern Web 2.0 techniques; (2) the developers prefer touse more URL-based links for state transitions, rather than using JavaScriptto avoid ending up in the hidden-web share. We believe that 50.6% is stilla high number, pointing to the pervasiveness of client-side hidden-web on30the Web.As discussed in the previous part and seen in Table 6.1, RANDOM in con-trast to ALEXA has lesser percentage of hidden-web in the mean, medianand the first quartile. However, the third quartile of RANDOM differs in 8%.Although 8% might not be significant now yet it should be noted that (1)we only analyzed 100 random websites and if more random websites wereto be observed this can lead to a higher percentage and a greater difference;(2) these websites do not have any sort of rankings. One possible reason-ing for this difference can be explained such that if the web application isusing JavaScript and modern technologies, then it seems more calls to theJavaScript code is made and thus, the DOM is modified more extensively.However this might not be true for all websites.TOTAL: When the results of ALEXA and RANDOM are combined in TOTAL, wewitness a total hidden-web state percentage of 62.52%. Interestingly, theaverage of the TOTAL is very close to the average of ALEXA. It should benoted that on average, 25 minutes is required to crawl all the 50 states of awebsite while it takes 211 hours to crawl and classify all the 500 websites.It is obvious that crawling more websites with a higher number of states anddifferent types of clickable elements can certainly lead to higher percentageof hidden-web and of course doing so will be very time consuming andexpensive.6.2 Quantity (RQ2)In order to gain an understanding of the quantity of content in the client-sidehidden-web states, we measured the amount of hidden data as described in Section5.3.Table 6.3 shows the minimum, mean, and maximum amount of client-sidehidden-web content for all of the crawled hidden-web states, and per hidden-webstate in two cases: (1) includes both the DOM structure and the content, and (2)31Table 6.3: Descriptive statistics of the average hidden-web content for allstates and per state.Textual hidden content (KB) All hidden content (KB)Hidden-Web Min Mean Max Min Mean MaxPer State 0 0.60 11.65 0 18.91 286.4All States 0 27.6 536 0 869.7 13170only the content is considered. For the first case, not only do we attempt to mea-sure the pure content, but we also consider extracting the different elements orattributes that either have been added, omitted or simply edited.All hidden-web states crawled. For all the states crawled, we measured an aver-age of 870 kilobytes of client-side hidden-web content including both DOM andtextual content while the textual content only is 27.6 KB. The minimum and maxi-mum for the first case (including DOM and textual content) are 0, 13170 kilobytesand 0 and 536 kilobytes for the latter case (only textual content).It is visible from the table that the average size of all the hidden content is morethan 30 times larger than the average of the textual content only. This indicatesthat an extensive amount of the HTML structure of the website is also altered inaddition to the content change. It should be noted that, a web application con-taining many hidden states does not necessarily result to a huge textual differencesince one web application that is comprised of a few hidden state may conclude tothe same or even a larger amount of hidden content. We simply, obtain the contentdifferences and do not compare.Per hidden-web state. Per hidden-web state, on average 19 kilobytes of DOMand textual content exist while 0.6 was only textual content. Table 6.3 shows theminimum and maximum for this case.Similarly, the mean size of all the hidden content per state is more than 3032times larger than the size of only the textual content. It is clear that every statedoes not lead to the same amount of hidden content, i.e., one state can only con-tribute to a few minor DOM and textual changes such as appending a new childwhile another state can alter (append, delete or edit) many elements, attributes andtheir text values.We also attempted to analyze the textual content as well to have a better under-standing of what sort of content are being altered. The textual content gatheredand observed are either a singular word, numbers, short messages or a whole sen-tence.The words are either (1) names of the countries so that the website can be usedworldwide and users can easily change the language of the website based on theirregion, (2) help instructions to aid users in pursuing a task, (3) feedback or reviewabout an object or a subject, and (4) names of items, specially items that werebeing sold in the website such as amazon or ebay.The numbers range from one digit to ten digits long. These numbers act as IDsand are usually seen in websites that sell items as discussed above. The short mes-sages indicate the limits of either the (1) user, like how many times he can listento the songs; (2) the website like the expiration time of the website. The sentencesare either normal sentences describing a subject such as the value of team work orthey are questions. Either way, they are mainly providing information as news inevery aspect, e.g. health issues, science, animals, videos and images, actors andstars, sports and so on.6.3 Induction (RQ3)To better understand what types of clickable elements web developers use in to-day’s web applications that induce state changes in the browser, we analyzedhow much each clickable type contributes to the measured hidden-web state per-centage. As a reminder, these elements can be either DIV, SPAN, INPUT,33Figure 6.3: Pie chart displaying the use of different clickables throughoutthe 500 websites.BUTTON along with the IMG and A tags.The usage distribution of different clickables can be seen in Figure 6.3. As canbe seen, approximately 40% of the websites use the DIV element (10327/27261),followed by the A tag with 32% (8734/27261) regardless of being visible or hid-den. The third most used element is the IMG tag with 19% considering both visibleand hidden states. The last three rated are SPAN, BUTTON and INPUT with 11%,0.3% and 0.1% whereas 0.3% and 0.1% are being rounded down to zero.As discussed in Section 5.3, the anchor tag (A) and the image element (IMG)can induce both visible and hidden states. Just to remember, an anchor tag isconsidered as an invisible element if it contributes to a new hidden state changebut without owning any valid URLs. Regarding the IMG tag, we assess the parenttag of this element that directs to a new state change since we have observedthat it is set inside other tags. If the parent tag is a visible element, in otherwords a visible anchor tag, the IMG tag is also considered as a visible element andthus, the resulting state is a visible state. Otherwise the element itself is addedto the set of invisible elements and the state change is concluded as a hiddenstate transition. However, to answer the third research question (RQ3), we onlyconsider the elements that cause hidden-states in our analysis.34Figure 6.4: Pie chart showing hidden-web percentage behind different typesof clickables. ‘A INVIS’ represents anchor tags without a (valid)URL. ‘MG INVIS’ represents IMG elements not embedded in an an-chor tag with a (valid) URL.Figure 6.4 depicts another pie chart specifically related to the different invis-ible clickable types associated to hidden-web state percentage. We can see thatthe DIV element has the highest contribution to the hidden-web state percentage(61%), followed by SPAN (16.8%). Interestingly, the IMG and A element typesare also used quite often to induce client-side hidden content, with 14.7% and6.9% each, respectively. Finally, BUTTON and INPUT contribute to less than onepercent of the hidden-web states with INPUT being the least.In final, we learn from the results that the DIV element, in comparison to theother elements, is more widely used in practice as a clickable.356.4 Correlations (RQ4)As a part of our study, we intended to investigate whether there are any correla-tions between the hidden-web and other characteristics of the web applications.Therefore, we considered three aspects of the websites which are explained in fulldetails below. These aspects are ALEXA categories, DOM size and JavaScriptcustom code size.Figure 6.5 displays the relationship between the ALEXA categories with thehidden-web state percentage and Figures 6-9 are scatter plots of DOM and JavaScriptsize against the hidden-web state percentage and content.Alexa Rank and CategoriesFor the websites obtained from ALEXA, we examined their rankings and cat-egories (e.g., Business, Computers, Games, Health, etc)2, to learn whether anycorrelation with respect to the degree of hidden-web content exist.Figure 6.5 presents the contribution of each Alexa category toward the percent-age of hidden-web. As can be seen, there are 15 categories in general. Websitesin the categories of Computers and Regional seem to contain the most client-sidehidden-web content. Websites in the Kids/Teen category have the least correlation.The hidden-web percentage of other categories are in the range between 5 and 30%. Note that each website can be a member of multiple categories on Alexa, andtherefore, the distribution shown in Figure 6.5 is merely an indication. We did notwitness any noticeable correlations with Alexa rank.DOM Size and Hidden-Web State PercentageWe analyzed the DOM size corresponding to the web applications for bothALEXA, RANDOM and as a TOTAL. As can be seen in Table 6.4, the minimum,maximum and average of DOM size present in ALEXA web sites are 11, 826and 146.2 kilobytes respectively. However, these amounts are less in websites2http://www.alexa.com/topsites/category361 2 3 4 5 6 7 8 9 10 11 12 13 14 15CategoriesHidden−Web States (%)01020304050Art BusCompGameHlth HmeKid/TeenNws RecRefRegSciShpSctSprtFigure 6.5: Bar chart of Alexa categories versus hidden-web state percent-age.obtained from RANDOM which are 7, 426 and 90.6 kilobytes respectively. It isobvious that websites collected from ALEXA are based on popularity and rankingand thus, contribute to more DOM size whereas RANDOM web sites are normalweb leading to less DOM size.We also conducted a correlation analysis of the degree of hidden-web statepercentage with respect to the average DOM size, taken over all the crawled states.Figure 6.6 depicts a scatter plot of the DOM size against the hidden-web statepercentage. In this figure, the DOM size has a weak correlation (r = 0.4) withthe hidden-web state percentage. A website that contributes to more hidden statesdoes not necessarily require to be composed of an enormous DOM tree. For ex-ample, one website with 57% hidden-web state percentage has the highest DOM37Table 6.4: Descriptive statistics of the DOM size in ALEXA, RANDOM, andTOTAL.DOM SIZE (KB)Resources Min Mean MaxALEXA 11 146.2 826RANDOM 7 90.62 420TOTAL 7 118.41 826Table 6.5: Descriptive statistics of the JavaScript custom code size inALEXA, RANDOM, and TOTAL.JAVASCRIPT SIZE (KB)Resources Min Mean MaxALEXA 1 116.44 586RANDOM 0 57.26 417TOTAL 0 86.85 586size (more than 800 KB), whereas the DOM size of websites with higher hidden-web percentage (90%) are less than 210 KB.DOM Size and Hidden-Web ContentWe also pursued to analyze the correlation between the degree of hidden-webcontent with respect to the average DOM size, again considering all the crawledstates.Figure 6.7 depicts a strong correlation (r = 0.65) between the DOM size withthe amount of hidden-web content. As can be seen in the plot, around 80% ofthe websites with less than 200 KB of DOM size contribute to less than 500 KBof hidden content. And as the DOM size grew larger, the hidden content alsoexpanded.The correlation between these two parameters comes as no surprise, because38llllllllllllllllllllllllllll llllllllllllll lllllllllllllllll lllllllllllll ll llll llll lll llll l llll0 20 40 60 80 1000200400600800r= 0.4 , p= 0Hidden−Web States (%)DOM Size (KB)Figure 6.6: Scatter plots of the DOM size versus hidden-web state percent-age. ‘r’ represents the Spearman correlation coefficient and ‘p’ is thep-value.the larger the DOM tree, the more content there will be in a website. However,for a web application with a large DOM tree, it is not clear which type of contentembedded within the application is more, the hidden content or visible content.JavaScript Size and Hidden-Web State PercentageFirst we analyzed the JavaScript custom code size corresponding to the webapplications for both ALEXA, RANDOM and as a TOTAL. As can be seen inTable 6.5, the minimum, maximum and average of JavaScript code size presentin ALEXA websites are 1, 116.44 and 586 kilobytes respectively. Although theremight be web applications that use JavaScript often, yet there are web applicationswhich contain zero or very little JavaScript code, as can be seen in RANDOM web39llllllllllllll lll lllllllllllllllllllllllllllllllllllllllllllll llll llll llllll0 500 1000 1500 2000 2500 30000200400600800r= 0.65 , p= 0Hidden−Web Content (KB)DOM Size (KB)Figure 6.7: Scatter plots of the DOM size versus hidden-web content. ‘r’represents the Spearman correlation coefficient and ‘p’ is the p-value.applications. Some web application prefer generating dynamic content by usingJavaScript , while other web developers tend to provide simple web applicationwithout the aid of JavaScript .In Figure 6.8, around 80% of the websites have less than 100 KB of JavaScriptcode while the hidden-web percentage related to those websites range from 0 to100%. The JavaScript code size of the remaining 20%, vary between 100 and586 KB.Although we assumed if the client-side hidden-web state percentage is high,the JavaScript code size will also be high, yet Figure 6.8 rejects our assumptionby presenting the weak correlation between these two parameters. A web appli-cation does not fundamentally need to use and call many of the JavaScript codeto induce hidden states. One JavaScript function that modifies the DOM tree can40lllllllllllllllll llllllllllll ll llllllllllllll lllllllllllllllllllllll ll llll lllllll lllllll ll ll0 20 40 60 80 1000100200300400500600r= 0.32 , p= 0Hidden−Web States (%)JavaScript Size (KB)Figure 6.8: Scatter plots of the JavaScript size versus hidden-web state per-centage. ‘r’ represents the Spearman correlation coefficient and ‘p’ isthe p-value.simply be executed many times by different elements and therefore, the resultswill be more hidden states.JavaScript Size and Hidden-Web ContentFigure 6.9 also pinpoints a weak monotonic correlation between the JavaScriptcode with the hidden-web content. We expected to see a stronger correlation, be-cause after all, it is JavaScript code that is the root cause of client-side hidden-webcontent. One less likely reason to the low correlation can be due to the exclusion ofpopular JavaScript libraries that were used within the web application. However,this weak correlation can also be explained from another perspective.As mentioned before, many websites use JavaScript today, yet they do not41llllllllllllllllll lllllllllllllllllllll llllllllllllllll llllll ll llll llllllll ll0 500 1000 1500 2000 2500 30000100200300400500600r= 0.29 , p= 0Hidden−Web Content (KB)JavaScript Size (KB)Figure 6.9: Scatter plots of the JavaScript size versus hidden-web content.‘r’ represents the Spearman correlation coefficient and ‘p’ is the p-value.necessarily require to have a lot of JavaScript code in order to modify the DOMtree and contribute to more hidden content. This behaviour can be explained us-ing a simple example as the one used in Chapter 2, Background and Motivation.Figure 2.1 is a piece of JavaScript code that can causes many hidden-web statesand thus, increases the hidden-web content although the amount of the code isrelatively small. In this simple example all the state updates are retrieved in smallHTML deltas from the server, and injected into the DOM tree through a smallpiece of JavaScript code. In fact, we have witnessed this kind of behaviour inmany of the examined websites that have client-side hidden-web characteristics.Although we have an idea about the correlation between the JavaScript and DOM42size with the hidden-web content and state, yet more investigation is required inthis regard. Maybe by examining more websites and inspecting their JavaScriptand DOM, a more significant correlation can be met.43Chapter 7DiscussionIn previous chapters, the approach proposed for this study, the hidden-web analy-sis technique and the evaluation for 500 websites are explained clearly. We nowhave an insight on how hidden-web is induced by client-side scripting languagessuch as JavaScript from the results. However, as other studies, our study may notbe fully accurate and risk free. There might be some aspects in our study that maybe seen as validity threats which can influence the evaluation of our results andaffect our conclusions. These aspects are explained in the following sections.7.1 Threats to ValidityThere are four validity threats: Internal Validity, External Validity, Construct Va-lidity, and finally Conclusion Validity. We discuss the four validity threats in thefollowing subsections.7.1.1 Internal ValidityThreats to internal validity concern any issues corresponding to how the subjectsare selected and treated during the experiment.One of the factors that impacts internal validity in our study is sampling. Ifthe websites were selected manually by the author, the final results would have44been non-exclusive and questionable. Thus, to reduce bias from our sampling, wedecided to select websites from ALEXA and RANDOM, where ALEXA provides uswith websites that are already ranked throughout the world and RANDOM providesus random websites without our participation.In addition, we restricted our analysis to the 400 websites gathered from ALEXAbased on popularity. Since the results may not hold for websites that are not popu-lar, we considered analyzing website that are not ranked by gathering 100 websitesfrom RANDOM.Another internal threat is sequentially selecting and crawling the clickable el-ements from the candidate elements list. Although this issue might not seemproblematic, yet it indeed effects the percentage of hidden-web. Thus, to mitigatethis kind of threat, the list of candidate elements are shuffled before clicking. Inother words, we randomize the candidate clickable selection while crawling, tomake the state exploration of each website unbiased.One way to mitigate the threats related to sampling is to select samples fromdifferent aspects such as the countries they are hosted, categories they represent,popularity and so on.7.1.2 External ValidityExternal validity refers to the generalization of the study, whether the final resultcan be generalized outside of the scope of the study.In order to present generalizable results, we not only analyze websites ob-tained from ALEXA, but also websites gathered from another resource, RANDOM.In terms of representativeness, we collected data for a sample of 500 websitesfrom ALEXA where these websites are ranked as the most popular websites in theworld.There are millions of web application that exist today and compared to themwe only examined a fraction. Therefore, the number of websites obtained forevaluation (500) is limited and considered as another external threat.In addition, there are other variables that are used in the setup of CRAWL-45JAX that can affect the evaluation phase. Each of these variables are discussedseparately below:Clickable Types: Through JavaScript event-driven programming any HTMLelement can potentially become a clickable item. In this study we includesix of the most common HTML elements used as clickables. We made ourselection based on a small pilot study we conducted on ten Alexa websites.Other clickable types (e.g., P, TD) could also induce client-side hidden-webcontent, which we have not analyzed. The inclusion of other clickable typescan probably marginally increase the hidden-web percentage.Event Types: Our study is constrained to the click event type. We believe thisis the most commonly used event type in practice for making event-driventransitions in Web 2.0 apps. However, the DOM event model has manyother event types, e.g., mouseover, drag and drop, which can potentiallylead to hidden-web states.In-code URLs: We make the assumption that if a transition is made throughJavaScript , then it is hidden. However, giant search engines such as Google,parse the website’s static JavaScript code in search of valid URL’s in thecode to crawl. In our study, we do not explicitly take valid URLs in the codeinto account.Number of states examined: To be able to have a fair analysis in a reasonableamount of time, we constrained the maximum number of states to crawlfor each website to fifty. There were a few websites that did not have thatmany states to crawl. In these cases, we analyzed the websites accordingto the number of crawled states. Choosing a different maximum numbercould theoretically impact our evaluation results, although we do not haveany evidence that that would be the case (because of the randomization).To alleviate the generalizability issue, we can analyze more websites alongwith more states. Although increasing the maximum number of states and depth46may not necessarily contribute to more hidden-web content, yet it is required togain more generalizable conclusions.7.1.3 Construct ValidityAs mentioned before the aim is to measure how often content is added to theportion of hidden-web due to the client-side scripting languages used by web de-velopers. In step 2 of Figure 5.1, the states in the state flow graph are classifiedinto two categories: visible-web lead to by hypertext links, and hidden-web leadto by non-hypertext links. However, there might be some cases where the hiddenstate is reached by external parties such as advertisers or adversaries. Today manyadvertisement companies embed their own JavaScript code into web applicationswith or without any permission and the crawler regardless of knowing where theclickable element is obtained from, clicks on it and if state transition is seen, anew state is added to the state flow graph.In order to mitigate the construct validity, all advertisements or external codenot relating to the web application itself should be removed. Although this isrequired yet it is a sensitive task and requires manual inspection.7.1.4 Conclusion ValidityConclusion validity is the degree to which the conclusions we reach about the re-lationships between our data are reasonable. In this study based on the frequencyof the usage of client-side scripting languages, specifically JavaScript , a conclu-sion that similar frequency is anticipated for the entire web is reached. The reasonbehind this conclusion is that the web is constructed by web developers. A threatregarding conclusion validity can be the fact that we did not take automated codegeneration tools into consideration. However, in order to minimize this threat, ourJAVIS tool is open-source and available for download as well as all the empiricaldata. Therefore, in terms of reproducibility, we provide all the necessity to fullyreplicate our study.To detract the conclusion validity we should take other sources of web con-47struction such as automated code generation tools into consideration.7.2 ImplicationsOur study shows that there is a considerable amount of data that is hidden due toclient-side scripting. It should also be noted that many of today’s advertisementsproduce dynamic content which can lead to hidden content too. The hidden con-tent is increasing rapidly as more developers adopt modern Web 2.0 techniques toimplement their web applications.We believe more research is needed to support better understanding, analy-sis, crawling, indexing, and searching this new type of hidden-web content. Inaddition, web developers need to realize that by using modern techniques (e.g.,JavaScript , AJAX, HTML5), a large portion of their content becomes hidden, andthus unsearchable for their potential users on the web.48Chapter 8Conclusions and Future WorkWith the advent of Web 2.0 technologies, an increasing amount of the web appli-cation state is being offloaded to the client-side browser to improve responsivenessand user interaction.Through the execution of JavaScript code in the browser, the DOM tree repre-senting a webpage at runtime, is incrementally mutated without requiring a URLchange. This dynamically updated content is inaccessible through general searchengines, and as a results it becomes part of the hidden-web portion of the Web.We present the first empirical study on measuring and characterizing the hidden-web induced as a result of client-side scripting.Our study shows that client-side hidden-web is omnipresent on the web. Fromthe 500 websites we analyzed, 476 (95%) contained some degree of hidden-webcontent. In those websites, on average 62% of the states were hidden, and perhidden state, we measured an average of 19 kilobytes of hidden content whereas0.6 kilobytes are textual content. The DIV element is the most commonly usedclickable to induce client-side hidden-web content, followed by the SPAN ele-ment. This points to the importance including the examination of such elementsin modern crawling engines and going beyond link analysis in anchor tags.As future work, our goal is to complement this study to gain more insight onthe size and growth of the hidden-web content. In order to achieve this goal, a few49enhancements have been considered which are discussed below separately.One of the improvements is to expand the amount of experimental objectsobtained for this research so that the final results become more generalizable.Therefore, instead of only considering 100 RANDOM web sites to analyze thehidden content, we will increase this number to 400 random web sites generatedautomatically using a single script.Another modification is to increase the maximum number of states examinedper web site. As discussed in the previous chapters, we set the maximum numberof states to 50 whereas in future we will set this to 70 states. Although we are notfully certain whether this will definitely increase the hidden web content, yet wewish to analyze the results by considering this circumstance.Another extension is to enhance our tool to automatically obtain the customJavaScript code within the web application. To pursue this task, we will attemptto use a proxy to easily gain access to the custom JavaScript code within theweb page and record all the existing interactions, specially the custom JavaScriptcode.50Bibliography[1] Alexa top sites. http://www.alexa.com/topsites/. → pages 1, 16[2] M. Alvarez, A. Pan, J. Raposo, and A. Vina. Client-side deep web dataextraction. In Proc. of the Int. Conf. on E-Commerce Technology forDynamic E-Business, pages 158–161. IEEE Computer Society, 2004. →pages 13[3] L. Barbosa and J. Freire. An adaptive crawler for locating hidden-web entrypoints. In Proc. of the 16th Int. Conf. on World Wide Web (WWW), pages441–450. ACM, 2007. ISBN 978-1-59593-654-7.doi:http://doi.acm.org/10.1145/1242572.1242632. → pages 1, 7, 12[4] Z. Behfarshad and A. Mesbah. Hidden-web induced by client-sidescripting: An empirical study. In Proceedings of the InternationalConference on Web Engineering (ICWE), volume 7977 of Lecture Notes inComputer Science, pages 52–67. Springer, 2013. → pages iii, 3[5] M. Bergman. White paper: the deep web: surfacing hidden value. Journalof Electronic Publishing, 7(1), 2001. → pages 1, 4, 5, 13[6] S. Brin and L. Page. The anatomy of a large-scale hypertextual Web searchengine. Comput. Netw. ISDN Syst., 30(1-7):107–117, 1998. ISSN0169-7552. doi:http://dx.doi.org/10.1016/S0169-7552(98)00110-X. →pages 12[7] M. Burner. Crawling towards eternity: Building an archive of the worldwide web. Web Techniques Magazine, 2(5):37–40, 1997. → pages 12[8] K. C.-C. Chang, B. He, C. Li, M. Patel, and Z. Zhang. Structured databaseson the web: observations and implications. SIGMOD Rec., 33(3):61–70,2004. → pages 1351[9] J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through URLordering. Computer Networks and ISDN Systems, 30(1-7):161–172, 1998.ISSN 0169-7552. → pages 12[10] S. R. Choudhary, H. Versee, and A. Orso. WebDiff: Automatedidentification of cross-browser issues in web applications. In Proc. of the26th IEEE Int. Conf. on Softw. Maintenance (ICSM’10), pages 1–10, 2010.→ pages 17[11] A. Dasgupta, A. Ghosh, R. Kumar, C. Olston, S. Pandey, and A. Tomkins.The discoverability of the web. In Proc. of the Int. Conf. on World WideWeb (WWW), pages 421–430. ACM, 2007. ISBN 978-1-59593-654-7.doi:http://doi.acm.org/10.1145/1242572.1242630. → pages 7, 12[12] A. F. de Carvalho and F. S. Silva. Smartcrawl: a new strategy for theexploration of the hidden web. In Proc.s of the ACM Int. Workshop on Webinformation and Data Management, pages 9–15. ACM, 2004. ISBN1-58113-978-0. doi:http://doi.acm.org/10.1145/1031453.1031457. →pages 7, 12, 13[13] C. Duda, G. Frey, D. Kossmann, R. Matter, and C. Zhou. Ajax crawl:making Ajax applications searchable. In Proc. Int. Conf. on DataEngineering (ICDE’09), pages 78–89, 2009. → pages 13[14] R. Gentleman and R. Ihaka. The R project for statistical computing.http://www.r-project.org. → pages 23[15] B. He, M. Patel, Z. Zhang, and K. Chang. Accessing the deep web.Communications of the ACM, 50(5):94–101, 2007. → pages 1, 13, 20[16] A. Heydon and M. Najork. Mercator: A scalable, extensible web crawler.World Wide Web, 2(4):219–229, 1999. ISSN 1386-145X. → pages 12[17] W. Hsieh, J. Madhavan, and R. Pike. Data management projects at Google.In Proc. of the Int. Conf. on Management of Data (SIGMOD), pages725–726, 2006. → pages 1[18] R. Khare, Y. An, and I.-Y. Song. Understanding deep web search interfaces:a survey. SIGMOD Rec., 39(1):33–40, 2010. → pages 1352[19] B. Krishnamurthy and C. Wills. Cat and mouse: content delivery tradeoffsin web access. In Proc. of WWW, pages 337–346. ACM, 2006. → pages 16[20] J. P. Lage, A. S. da Silva, P. B. Golgher, and A. H. F. Laender. Automaticgeneration of agents for collecting hidden web pages for data extraction.Data Knowl. Eng., 49(2):177–196, 2004. ISSN 0169-023X.doi:http://dx.doi.org/10.1016/j.datak.2003.10.003. → pages 1, 7, 12, 13[21] P. Liakos and A. Ntoulas. Topic-sensitive hidden-web crawling. Proc. ofthe Int. Conf. on Web Information Systems Engineering (WISE), pages538–551, 2012. → pages 13[22] J. Madhavan, D. Ko, L. Kot, V. Ganapathy, A. Rasmussen, and A. Halevy.Google’s deep web crawl. Proc. VLDB Endow., 1(2):1241–1252, 2008.doi:http://doi.acm.org/10.1145/1454159.1454163. → pages 7, 12[23] A. Mesbah and S. Mirshokraie. Automated analysis of CSS rules to supportstyle maintenance. In Proc. of the 34th ACM/IEEE Int. Conf. on Softw Eng.(ICSE), pages 408–418. IEEE Computer Society, 2012. → pages 17[24] A. Mesbah, E. Bozdag, and A. van Deursen. Crawling Ajax by inferringuser interface state changes. In Proceedings of the InternationalConference on Web Engineering (ICWE), pages 122–134. IEEE ComputerSociety, 2008. → pages 13[25] A. Mesbah, A. van Deursen, and S. Lenselink. Crawling Ajax-based webapplications through dynamic analysis of user interface state changes. ACMTransactions on the Web (TWEB), 6(1):3:1–3:30, 2012. → pages 13, 18,19, 20, 25[26] A. Mesbah, A. van Deursen, and D. Roest. Invariant-based automatictesting of modern web applications. IEEE Trans. on Softw Eng. (TSE), 38(1):35–53, 2012. → pages 20[27] A. Ntoulas, P. Zerfos, and J. Cho. Downloading textual hidden web contentthrough keyword queries. In Proc. of the ACM/IEEE-CS joint conferenceon Digital libraries, pages 100–109. ACM, 2005. ISBN 1-58113-876-8.doi:http://doi.acm.org/10.1145/1065385.1065407. → pages 1253[28] B. Pinkerton. Finding what people want: Experiences with the web crawler.In Proc. of the Int. World Wide Web Conf. (WWW), volume 94, pages17–20, 1994. → pages 12[29] S. Raghavan and H. Garcia-Molina. Crawling the hidden web. In Proc. ofthe Int. Conf. on Very Large Data Bases (VLDB), pages 129–138, 2001.ISBN 1-55860-804-4. → pages 1, 5, 7, 12[30] W. Wu, A. Doan, C. Yu, and W. Meng. Modeling and extracting deep-webquery interfaces. In Advances in Information and Intelligent Systems,volume 251, pages 65–90. Springer, 2009. → pages 7, 12[31] C. Yue and H. Wang. Characterizing insecure JavaScript practices on theweb. In Proc. of the Int. World Wide Web Conf. (WWW), pages 961–970.ACM, 2009. ISBN 978-1-60558-487-4.doi:http://doi.acm.org/10.1145/1526709.1526838. → pages 1, 1654

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0167437/manifest

Comment

Related Items