You may notice some images loading slow across the Open Collections website. Thank you for your patience as we rebuild the cache to make images load faster.

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Accelerating web search using GPUs Tadros, Rimon 2015

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


24-ubc_2015_november_tadros_rimon.pdf [ 1.33MB ]
JSON: 24-1.0165790.json
JSON-LD: 24-1.0165790-ld.json
RDF/XML (Pretty): 24-1.0165790-rdf.xml
RDF/JSON: 24-1.0165790-rdf.json
Turtle: 24-1.0165790-turtle.txt
N-Triples: 24-1.0165790-rdf-ntriples.txt
Original Record: 24-1.0165790-source.json
Full Text

Full Text

Accelerating Web Search using GPUsbyRimon TadrosPGD., Alexandria University, 2007B.A.Sc., Alexandria University, 2004A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF APPLIED SCIENCEinThe Faculty of Graduate and Postdoctoral Studies(Electrical and Computer Engineering)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)August 2015c© Rimon Tadros 2015AbstractThe amount of content on the Internet is growing rapidly as well as thenumber of the online Internet users. As a consequence, web search enginesneed to increase their computing capabilities and data continually whilemaintaining low search latency and without a significant rise in the cost perquery.To serve this larger numbers of online users, web search engines utilize alarge distributed system in the data centers. They partition their data acrossseveral hundred of thousands of independent commodity servers called IndexServing Nodes (ISNs). These ISNs work together to serve search queries asa single coherent system in a distributed manner. The choice of a highnumber of commodity servers vs. a smaller number of supercomputers isdue to the need for scalability, high availability/reliability, performance,and cost efficiency.For the web search engines to serve a larger data, the web search en-gines can be scaled either vertically or horizontally [21]. Vertical scalingenables ranking more documents per query within a single node by employ-ing machines with higher single thread and throughput performance withbigger and faster memory. Horizontal scaling supports a larger index byadding more index serving nodes at the cost of increased synchronization,aggregation overhead, and power consumption.This thesis evaluates the potential for achieving better vertical scalingby using Graphics processing unit (GPUs) to reduce the documents rankinglatency per query at a reasonable initial cost increase. It achieves this byexploiting the parallelism in ranking the numerous potential documents thatmatch a query to offload to the GPU. We evaluate this approach usinghundreds of rankers from a commercial web search engine on real productiondata. Our results show an 8.8× harmonic mean reduction in the latencyand 2× power efficiency when ranking 10000 web documents per query fora variety of rankers using C++AMP and a commodity GPU.iiPrefaceThis dissertation is original, unpublished, independent work by the author,Rimon TadrosiiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . xi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . 42 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1 Review of Modern Web Search Engines . . . . . . . . . . . . 52.1.1 Web Search Engine Components . . . . . . . . . . . . 52.1.2 Web Search Engine Indexes . . . . . . . . . . . . . . . 62.1.3 Query Processing and Index Serving . . . . . . . . . . 72.1.4 Document Ranking Overview . . . . . . . . . . . . . 92.2 GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 Web Search on GPUs . . . . . . . . . . . . . . . . . . . . . . . 203.1 Determining What to Offload to the GPU . . . . . . . . . . 203.2 GPU Design Overview . . . . . . . . . . . . . . . . . . . . . 213.3 Feature Extraction on GPU . . . . . . . . . . . . . . . . . . 223.3.1 Dependent Feature Extraction . . . . . . . . . . . . . 253.4 Model Input Transformation . . . . . . . . . . . . . . . . . . 25ivTable of Contents3.5 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 303.5.1 Neural Network Evaluation . . . . . . . . . . . . . . . 313.6 GPU Models Memory Management . . . . . . . . . . . . . . 333.7 Software Deployment and Failure Handeling . . . . . . . . . 333.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344 Experimental Methodology . . . . . . . . . . . . . . . . . . . 355 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 395.1 Power Consumption . . . . . . . . . . . . . . . . . . . . . . . 426 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 48Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49vList of Tables3.1 MIT param1, param2 and FeatureIndex meaning per trans-formation type. . . . . . . . . . . . . . . . . . . . . . . . . . . 264.1 Hardware specs. . . . . . . . . . . . . . . . . . . . . . . . . . 364.2 R-S-BLG, R-M-BGD, R-L-D ranker details. . . . . . . . . . . 37viList of Figures2.1 Overview of web search engine. . . . . . . . . . . . . . . . . . 62.2 Overview of Index Serve. . . . . . . . . . . . . . . . . . . . . . 82.3 Query processing main steps. . . . . . . . . . . . . . . . . . . 102.4 Document ranking steps. The yellow boxes are the GPU mod-ules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.5 Dynamic Feature Extraction FSM . . . . . . . . . . . . . . . 132.6 Simple three Layers Decision Tree. . . . . . . . . . . . . . . . 152.7 simple two layers neural net. . . . . . . . . . . . . . . . . . . 163.1 GPU implementation overview. . . . . . . . . . . . . . . . . . 213.2 Example FE request. . . . . . . . . . . . . . . . . . . . . . . . 223.3 GPU FE serial request format. . . . . . . . . . . . . . . . . . 233.4 GPU FE request format-parallel streams. . . . . . . . . . . . 243.5 Model Input Transformation Types. . . . . . . . . . . . . . . 263.6 Model input transformation structure. . . . . . . . . . . . . . 263.7 TreeNode structure. . . . . . . . . . . . . . . . . . . . . . . . 273.8 MIT unoptimized. . . . . . . . . . . . . . . . . . . . . . . . . 273.9 Model inputs reference features in feature vector using thefeature map. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.10 MIT optimized. . . . . . . . . . . . . . . . . . . . . . . . . . . 293.11 Modeling neural network evaluation as matrix multiplicationfor single document. . . . . . . . . . . . . . . . . . . . . . . . 313.12 Modeling neural network evaluation as matrix multiplicationfor multiple documents. . . . . . . . . . . . . . . . . . . . . . 323.13 Neural Net evaluation of one layer of the for multiple documents. 325.1 R-S-BLG execution time on the CPU and the GPU. . . . . . 405.2 Details of R-S-BLG normalized execution on the GPU. . . . . 405.3 R-M-BGD execution time on the CPU and the GPU. . . . . . 415.4 Details of R-M-BGD normalized execution on the GPU. . . . 415.5 R-L-D execution time on the CPU and the GPU. . . . . . . . 435.6 Details of R-L-D normalized execution on the GPU. . . . . . 43viiList of Figures5.7 Average Power Consumption per Ranker Type . . . . . . . . 44viiiGlossaryFE Feature ExtractionFSM Finite State MachineFTP File Transfer ProtocolGPGPU General-Purpose computing on Graphics Processing UnitsGPU Graphics Processing UnitsHLSL High-Level Shader LanguageHTML HyperText Markup LanguageHTTP Hypertext Transfer ProtocolIDF Inverse Document FrequencyIR Information RetrievalISN Index Server NodeISR Index ReaderL1Ranking Level 1 ranking also called selectionL2Ranking Level 2 ranking also called rankingMIE Model Input Evaluation (FE and MIT)MIT Model Input TransformationME Model EvaluationQPS Query per secondSE Search EngineSERPs Search Engine Result PagesixGlossarySLA Service Level AgreementUI User InterfaceURL Uniform Resource LocatorWSQ Web Search QueryWWW World Wide WebxAcknowledgementsI would like to thank my supervisor, Professor Tor Aamodt, for all the helpin this great learning journey. Also, my managers and peers at Microsoftwho supported me to finish this work. I would also like to thank the othermembers of the computer architecture group for their support. Finally, I’dlike to thank my family for everything that they have helped me out withthroughout my life.xiChapter 1IntroductionWeb Search engines are the systems that help users to find relevant webpages on the Internet quickly. The users enter a few keywords, and thesearch engines return a page that contains the result as a list of links torelevant pages on the Internet. The result relevant pages are not similar toeach other. In another word the search engines return relevant pages frommultiple prospectives that called ”diversification” [1].Web search engines are crucial to the Internet users. Commonly, theInternet users reach web pages on the Internet by one of the following ways;using web search engines, following links from page to page, or typing theURL in the browser. [34] shows that the web search engines influence 13.6%of the users web traffic and helps users reach 20% more sites. Other surveysshow that search affect 50-80 percent of the way users find a new web page.Web Search engines are large and competitive business and to attractmore users, they must continuously expand their compute capability anddata to match the increasing size of the Internet and the growing numberof users. The size of the internet is growing exponentially [14, 16, 24, 43].There are many techniques to measure the size of the Internet including theMonte Carlo and the importance sampling [43]. The number of the Internetusers are increasing linearly, and reached 3 billions of users [40]. The rise ofthe amount of content on the Internet and the growth in the number of usersare challenging for the web search engine. This challenge has led researchersand practitioners to look at software and hardware approaches to improvethe system scalability.The core data of the web search engines is its index. The web searchindex can be simplified as a record of where do words appear on the Internetweb pages. In reality, there are types of indexes, and they contain much moreinformation about the web pages that they index. The index size is crucialto a better user experience. The index size is measured by the number ofweb pages indexed in that index.The Internet is large, and any commercial web search engine index willalways contain a subset of the information on it. The size of the major websearch engines index is not public information. However, some effort like1Chapter 1. Introductionin [6] shows that the index size of Google is around 50 billions of pageswhile the index size of Bing is around 15 billion of pages. The larger theweb search index, the more information it has that can be used to satisfythe users queries that lead to a better user experience. The index needs tobe periodically updated to match the changes of the internet. Index highfreshness and Index volume increase are significant challenges. Expandingthe index is challenging in both stages; index building (also called indexing)and index serving (serving the users queries from the index).Creating and growing the index requires crawling, indexing, compres-sion, and storage. Crawling is following the URL links from one page toother pages, also called the web spider, then downloading these pages. In-dexing is recording what words appear in the crawled pages. Commercialweb search engines use distributed index building systems that run on multi-ple servers simultaneously. There are many challenges in crawling includingspam detection, content quality measuring, duplicates removal, network op-timization, and scalability. Scalability in these steps is challenging. Still,they are offline processes because they happen in the background withoutdirect interaction with the end user. Therefore, the index building has morerelaxed restrictions. Because of that, we decided to look into the scalabilityof the most challenging problem that is the index serving scalability.Serving the live users queries is an online process. Query processing thatservices a bigger index is challenging in many ways such as in index reading,decompression, caching, documents selection, and document ranking. Theindex reading and caching are I/O bound processes. Document Selection isthe process of finding web pages that contain the user query words. Whiledocument ranking is ranking these selected pages to present the high rankeddocuments to the user. The higher the document rank, the higher therelevance to the user query. These steps, index reading, decompression,caching, documents selection, and ranking, happen in a very restricted timebudget or SLA (Service level agreement).Users are expecting to get their search result in less than a second.Studies show that users who experience a delay in the response of any websiteget frustrated and tend to leave and ignore that site [5, 9]. Therefore withthis fixed latency budget per query, any improvement in one process of thequery processing enables more time to be available for other processes andreduces the overall query latency. The decrease of the query latency leadsto happier users. Also, another benefit of speeding up the query processingis that it enables searching through a bigger index per node. Searching inlarger index fast increases the search quality that also leads to satisfied users.Improving the scalability by employing more powerful servers or by in-21.1. Contributionscreasing the number of servers always comes at the cost of more powerconsumption. Power consumption in the data center incurs the majority ofthe cost. The data center energy consumption is growing. Some studies [17]suggested that growth in the data center electricity use from 2005 to 2010is 50%. Recent research [15] explored using less power consuming cores toreduce the cost of query processing. While energy was reduced, this cameat the expense of decreased quality of search results since fewer documentscould be ranked within this fixed latency budget. There is a need to useefficient hardware that can deliver better query/$. Researchers looked atusing custom fabric that delivers a more efficient solution than the generalpurpose CPU [33], but it comes at the cost of development effort and codemaintainability. In this thesis, we consider accelerating document selectionand ranking on GPUs, using a programming language similar to C++ calledC++AMP. C++AMP is shown to increase the development productivity byproviding elegant ways to handle the users kernel code and synchronizingthe data between the CPU and the GPU memory. The users kernel codeuses lambda expression inside the parallel for each and restrict(amp) key-word that allow the developer to write code efficiently. Moreover, the datacopy between the CPU and GPU memory is handled using the array andarray view constructs that make the data transfer transparent to the devel-oper.Also, we are using well-established development and debugging toolssuch as Microsoft Visual Studio IDE and development stack. Using theGPU enables ranking higher number of documents per query within thisfixed latency budget at a better cost.Graphics Processor Units (GPUs) such NVIDIA’s Kepler [29] have beenshown to increase throughput by up to an order-of-magnitude versus welloptimized multicore CPU code [19]. This increase in the throughput is pos-sible because they dedicate more area to computation rather than complexinstruction scheduling hardware and cache.1.1 ContributionsBelow is the summary of our contributions:• Evaluation of the use of GPUs in the data center workload. We exam-ined the use of GPUs for web search engines query processing. Fig-ure 2.4 shows the document ranking steps and the modules that weaccelerated. We improved the fast first-pass rankers of the selectionphase as well as the second-stage expensive rankers of the ranking31.2. Thesis Organizationphase. These improvements lead to improving the overall performanceof document ranking.• One of the first evaluations of large-scale software on a GPU usingC++AMP.1.2 Thesis OrganizationThe rest of this thesis is organized as follows:• Chapter 2 provides a summary of the web search architecture we aim toincrease the index size of by using GPUs. Also, it presents a summaryon the GPUs and their programming model.• Chapter 3 describes the changes we made to our web search engine tooffload part of the query processing onto a GPU using C++AMP.• Chapter 4 describes our evaluation methodology.• Chapter 5 provides our experimental results and analysis.• Chapter 6 summarizes related work.• Chapter 7 concludes.4Chapter 2BackgroundIn this Chapter, we will give a background on the web search engines andGPUs.2.1 Review of Modern Web Search EnginesWeb search engines are systems that index the Internet (called documents)to make its content searchable. Examples are Google and Bing. They allowusers to search these indexes to find web pages on the Internet using a simpleHTML UI. This simple web UI typically contains a search box that userstype the keywords in and a result page that contains around 10 to 20 links topages on the Internet. These links are called the search engine “blue links”.2.1.1 Web Search Engine ComponentsFigure 2.1 show the main subsystems of the web search engine and theirrelations together. These subsystems are index builder and index server.The Index Builder contains the crawler [13]. A crawler is a software thatvisit pages on the internet and download their content then follow the linksto these pages to find more pages. Crawlers face many challenges includ-ing memory size, network optimization, and spam. There are an immensenumber of spam web pages on the Internet. Spam can generate millions ofdollars if spammers get the users clicks. The spammers artificially generatethe spam web pages, and they make them crawler friendly. The crawler hasto detect the spam pages and add metadata to the index to differentiatethese pages [27]. After downloading the web page, the Index Builder add,update, or delete the content of that page in the Index. This process ofmodifying the index is called index merge, and it happens with minimuminterruption to the index serving process.52.1. Review of Modern Web Search EnginesInternetSearch engine front pageIndex serveQuery RequestSelectionRankingIndex Builder\Web crawlerQueryVisit Page(s)Web IndexesWeb PagesSearch engine Result pageResult10 Blue LinksFigure 2.1: Overview of web search engine.2.1.2 Web Search Engine IndexesThere are two main types of indexes; inverted indexes and forward in-dexes [20]. The Inverted Index is a map from words to web pages whichmeans giving a word the inverted index outputs a list of web pages thatcontain that word. In an inverted index web pages on the internet is as-signed a unique identifier called document identifier or DocId. An exampleto clarify how the inverted index works: The inverted index for two words“Lazy Dog” is as following:• Lazy: 10, 12, 18, ...• Dog: 2, 8, 12, 15, 18, ...That means the word “Lazy” appeared in documents: 10, 12, and 18. Also,the word “Dog” appeared in documents: 2, 8, 12, 15 and 18. The documentselection phase of “index serve” uses the inverted index to find the pages thatcontain a user’s query words quickly. Finding the web pages that containmultiple keywords using the inverted index is called list intersection. Theinverted index is compressed and stored on disk, and part of it gets cachedin memory [44].62.1. Review of Modern Web Search EnginesThe forward index contains a map from pages to words. Each word hasan identifier or WordId. An example of the forward index for DocId 10 and15.• DocId 10: 2, 5, 7, ...• DocId 20: 7, 10, 7, ...That means document id 10 has words id: 2, 5, 7, and so on. Also, document20 has word 7, 10, 7, and so on. The forward index is used for featureextraction in the document ranking phase of index serving as we will see inthe next sections.These indexes are represented by bitmaps. They are compressed andstored on disk. At run time parts of these indexes are loaded and decom-pressed from disk and stored in memory.2.1.3 Query Processing and Index ServingUsers search queries are handled by the index serving part of the web searchengine. Figure 2.2 shows the targeted index serving workflow end-to-endsystem [4, 8, 15, 45]. Users enter the search query in the web browser thatsends an HTTP request to the search engine. The search query gets routedto a certain data center by the load balancer. The search query comes tothe “front end”. It gets checked against a recent results cache. If the queryis in the cache (cache hit), the query answer (search result) also exists in thecache. The index server returns the results to the user immediately. Thiscache reduces the search query cost because each hit in the cache brings theresult to the user without performing an expensive index search. There aremany optimizations that are built in the results cache to determine when tostore a new query and when to evict existing queries. When the cache miss,the search query answer is not in the cache, the query request is passed on tothe back-end. The back-end has a result aggregator. The result aggregatorinitiates a search request to a set of index serving machines (ISNs). This setof machines together contains a copy of one full index. Each machine worksindependently to search its portion of the index. The request is sent tothe document selection phase. Document selection phase uses the invertedindex, selects the candidate documents, and forward the ranking request tothe Ranking stage.Ideally, all the index serving machines reply back to the aggregator withthe result (set of documents) and a corresponding score per document. How-ever, in some cases some of the ISNs fail, and the query is not completed.72.1. Review of Modern Web Search EnginesFront EndResultsSearch CacheResult in Cache ?YesAggregatorNoISN 1 ISN 2 ISN nIndex Partition 1Index Partition 2Index Partition NBack EndSelectionRankingSelectionRankingSelectionRankingSearch QueriesFigure 2.2: Overview of Index Serve.82.1. Review of Modern Web Search EnginesIn this case, the partial result is sent back to the user but it does not getcached. The higher the score, the more relevant the document is. ISNs mustreply back to the aggregator within a certain duration called SLA “servicelevel agreement”. This rigid time restriction makes some ISNs terminate thesearch before ranking all pages in its partial index, known as early termina-tion [46]. Early termination may affect the search quality. The aggregatormerges and sorts the results coming from multiple ISNs, adds captions (textthat appears under each result’s web link), and forwards the result to thefront-end. The front-end caches the result if the search was successful, andsends the result back to the user.Scaling in these distributed architectures happens in two ways: vertical,and horizontal. Vertical scaling happens by obtaining more powerful/fastermachines with more CPUs/cores, more memory, and a faster disk or SSD.The problem with this method of scaling is that multicore processor perfor-mance growth is not keeping up with the exponential growth of the Internet.Horizontal scaling happens by adding more index serving machines to thisdistributed system. Ignoring the aggregation overhead, the problem withthis is the increase of each query cost, with the growing number of the phys-ical hardware the data centers. The increase is twofold, first, because ofthe initial hardware cost and, more importantly, the operation cost that isadded by the additional ISNs. For example, to scale from serving an indexof size 1× to 2× with the same hardware, we need to double the numberof index server machines. The query cost will double even for queries thathave the best result (high relevant documents) in the smaller index and donot need that larger index. The query cost increase happens because eachquery that misses in the cache will use twice the number of machines af-ter scaling. Typically search engines mitigate this by having different sizedindexes. This optimization is out of the scope of this thesis.2.1.4 Document Ranking OverviewQuery processing involves two main steps [45]. Figure 2.3 describes thesesteps. The first step is the document selection step that utilizes a fast first-pass rankers. This step matches the indexed documents with the searchquery terms. Which also means it finds the set of documents that containthe query terms. This step works on a large number of documents; its inputis the inverted index partition on the selection ISN and the query terms.The output of this step is a set of documents that contain all or someof the search query terms. The index generator compresses the invertedindex [49] and stores it on disk/SSD. At runtime, the index manager loads92.1. Review of Modern Web Search EnginesIndexPartitionQueryFast first-passrankingTop N high ranked documentsSecond-stage expensive rankingMatched documentsFigure 2.3: Query processing main of the index from disk/SSD, decompresses them, and caches them inmemory. Interestingly the index cache hit rate can be up to 90% even ifthe memory holds only 20% of the inverted index at any given time [8].The inverted index format maps general words to documents that containthese words. The inverted index makes it easy and fast to find the set ofdocuments that contain the query words. In this fast first-pass ranker phase,the index server decompresses the index, finds a list of documents for eachquery word, does list intersection, and runs a simple scoring ranker on theresulting documents.For example, a search for “Lazy Dog” in the example above, will returndocuments 12, 18 after the list intersection. After it will run a simple rankeron each document to give it a score. This simple scoring ranker needs to befast in the current architecture because it runs on a large number of doc-uments per index partition. The simple ranker in this phase requires somedocument features to be extracted (retrieved and/or calculated). Documentfeatures have two types: static and dynamic.Static document features can be simply retrieved from the index (likethe static rank or page rank [32], which reflects the quality of the docu-ment). Whereas, dynamic document features need to read more data fromthe document and run some calculations, as a number of word occurrences,BM25 [35], and other features.The output of this fast first-pass ranking is a set of documents selectedfrom the local ISN index partition that could potentially be a good matchto the query. There are many techniques for inverted index compression/de-compression and list intersection [49]. In our system, we run this processon the CPU. This output set of documents is passed to the next step, theexpensive second-stage ranking.102.1. Review of Modern Web Search EnginesFeature extractionFeature #2Feature #1Feature vector  (V)Model input evaluationFeature #2Feature #1Feature vector  (V )Model evaluationInput document Document rank (score)Search Query termsFigure 2.4: Document ranking steps. The yellow boxes are the GPU mod-ules.The second step is the document ranking step that utilizes an expensivesecond-stage ranker. This step works on a smaller number of documentsper query. The second-stage expensive ranking phase extracts more docu-ments features (static and dynamic features) and uses them to rank thesedocuments, sorts them by that rank and returns top N documents to theaggregator.Figure 2.4 shows the document ranking steps. It shows data flow andmodules of the first or second stage rankers. The difference between the firststage or the second stage rankers is their complexity. A ranker’s complexityis measured by the number of features that it needs to extract from thedocument. Complexity is also measured by the size of the ranker’s machinelearning model. Ranking one document involves three steps:First, document feature extraction (FE). The input to this module is theintersection on the search query terms and the document that needs to beranked. The output of this stage is a feature vector. The feature vector is avector that contains all the needed document features values to evaluate themachine learning model that we will describe in model evaluation below.The second step is ranking model inputs transformations (MIT). Thisstep transforms the feature vector; that is generated from FE step. Thesetransformations are one-to-one or many-to-one, which means one or morefeatures from the original feature vector V are involved in calculating onefeature in the transformed feature vector V’.The third step is the ranking model evaluation (ME). This step takes thetransformed feature vector V’ and evaluates a machine learning model, Theoutput of this step is one final number that represents the document rankor score. In this context, a document with a higher score means a betterresult for the user because it is more relevant to their search query. As part112.1. Review of Modern Web Search Enginesof this thesis, feature extraction, model input transformation and rankingmodel evaluation were ported to the GPU. They are described in the nextsection.Feature Extraction FEIn contemporary web search engines, there are many types of features thatget extracted from each document that get ranked in the index.First, “static features” are assigned by the crawler, and they are storedin the document metadata. The static document features are fixed perdocument, and they do not depend on the query. Using these static featuresfor ranking requires copying their values from the document data to thefeature vector only as no processing or calculations are required.Second, “dynamic features” depend on the query terms. For a givendocument, a feature value differs for different query terms. Hence, dynamicfeature extraction required online processing. Some examples of these dy-namic features are the number of query word occurrences in the document.(2.1)Finally, “Dependent features” which are a combination of other staticand dynamic features, e.g., BM25F. BM25 feature is a document rankingfeature that uses the number of query terms occurrences in the document’stream and the size of the document’ stream. BM25 uses these two featuresto calculate the adjusted term frequency for each word and the inverse doc-ument frequency (IDF). Then it applies a formula to calculate the BM25feature value. Equation (2.1) shows the BM25 formula. These dependentfeatures have to be evaluated after the dynamic feature evaluation.There are hundreds of different dynamic features that are involved inranking each document.The dynamic feature extraction logic is representedby a finite state machine FSM.Figure 2.5 describe the general flow of any feature based on the inputdocument.The document is feed to the FSM. The FSM extracts the documentheader then the stream header and finally the phrase and words. Most122.1. Review of Modern Web Search EnginesExtractDocument HeaderNew StreamExtract MetaStream HeaderExtractPhrase HeaderExtract Word Header and Perform calculationNew wordEnd of StreamNew PhraseNew PhraseNew wordPublish Stream FeatureNew StreamEnd of StreamNew PhraseEnd of StreamNew DocumentPublish Document FeatureEnd DocumentFigure 2.5: Dynamic Feature Extraction FSMof the computation happens in the extract word state. The computationaggregation and feature value publishing happen in the following states:publish stream feature and publish document feature states. Each FSM canpublish multiple different features. For example, the “Word Occurrence”FSM publishes the number of occurrences feature as well as the first and lastoccurrence features. The size of the feature output is variable, some featuresoutput a scalar value and other features output one and two-dimensionalarrays.Section 3.3 describes our GPU implementation of the feature extractionon the GPU.Model Inputs Transformation MITModel inputs transformation is a transformation on the features vector fromV to V’. The number of elements in the transformed feature vector V’differs from the number of items in the original feature vector V, whichhas implications for implementation on the GPU as described later. Thesetransformations are categorized into two types: single feature transformationand multiple features transformations.Single feature transformation allows the ranker designer to scale or clampthe feature input value. The memory access pattern, in this case, is regularwhere each V[i] is transformed into V’[i].Examples of these transforms:132.1. Review of Modern Web Search Engines• Linear: The equation for a linear transformation is:output = input× slope + intercept (2.2)• Log Linear: The equation for a log linear transformation is:output = log(input + 1)× slope + intercept (2.3)• Bucket: The equation for a bucket transformation is: output = 1 iffLowerLimit ≤ input < UpperLimit output = 0 otherwiseoutput ={1iff LowerLimit ¡ input ¡= UpperLimit0otherwise(2.4)Multiple features transformations allow the ranker designer to combinemultiple features together. Examples of these transforms:• Mathematical Expressions (FreeForm): The equation for themathematical expressions transformation is:output = Function(input1, input2, ... inputn) (2.5)• Decision Trees:Decision trees are unbalanced binary trees, and each tree has a topol-ogy, internal nodes with thresholds, and leaf nodes with output values.Decision trees models are trained offline and are ready to be used. Theinputs to the decision tree are the extracted features, and the outputis a single number that represents the tree evaluation score. Figure 2.6shows a simple three layers decision tree with three internal (I1, I2,and I3) nodes and These leaf nodes (O1, O2, O3, O4). The real rankerdecision tree may have hundreds to thousands of nodes. The decisiontree evaluation starts at the root I1. Feature A is compared againstthreshold1. If feature A value is greater than threshold1 the nextnode is I2 else the next node is I3. The process keeps repeating untilit reaches a leaf node. When this happens, the leaf node output valueis returned as the result of this tree evaluation. As you can see theaccess pattern to the features in the feature vector and choosing thenext tree nodes is irregular in the tree evaluation. Some methods canbe applied to overcome the branch and memory divergence. Thesemethods will be discussed in the GPU implementation.142.1. Review of Modern Web Search Engines(I1)Feature A > Threshold 1(I2 )Feature B > Threshold 2RootYes(I3 )Feature C > Threshold 3O 1 O 2 O 3 O 4NoYes No Yes NoFigure 2.6: Simple three Layers Decision Tree.152.1. Review of Modern Web Search EnginesI (0)I (N )I (1)H  (0)H (N )H (1)W (0, 0)W (0, 1)W (1, 0)W (1, 1)Input layer Hidden LayerI (0)OutputFigure 2.7: simple two layers neural net.162.2. GPUsModel Evaluation METhe machine learning models that are used in information retrieval are Neu-ral Nets, Decision Trees, and a mix of multiple models which is called en-semble. Three steps are needed to use the machine learning model; Firstdesigning the model. Second training the model. Third is the executionor evaluation of the model. Designing and training the model happens of-fline by the relevance team, and the output of this training is the modeldescription and the weights/thresholds. The trained models are typicallyconsidered the search engine’s “secret sauce” (i.e., the source of competitiveadvantage). Model evaluation is the final step that calculates the documentrank, and it is an online process. We chose neural nets that aggregate thefeatures and decision trees outputs.Figure 2.7 shows an example of a two layers neural net. The input is anarray “I” of N elements. The first layer is Layer H. Layer H has N nodes.There are N × N weights between the inputs and the first layer. The valuesat any node H(x) is the sum of products of all the inputs.H(i) =N∑j=0I(j) ∗W (j, i) (2.6)Each layer is the sum of products of the previous layer. The final Layeris just one neuron.2.2 GPUsGPUs were introduced in 1999 for PC industry. They initially were de-veloped to handle the computer graphics computation and rendering. Thenature of the graphics algorithms is multiple independent operations withhigh parallelism. Each pixel that appears on the computer monitor requirestransformation, lighting, triangle setup/clipping and rendering. Over theyears, the computer graphics and games became sophisticated with moredetailed. This advancement in computer graphics and games required fasterprocessing speed that led to tremendous advancement in the GPUs. Cur-rently, the two main vendors produce these graphics processors; NVidia andAMD.Modern graphics processor units (GPUs) are organized around Multi-threaded SIMD function units. Typically, an application starts on the CPUand launches a data parallel kernel onto the GPU. The kernel consists of ahierarchy scalar threads organized into small groups of 32-64 scalar threads,172.2. GPUscalled a warp in NVIDIA terminology, that execute in lock-step on the SIMDhardware. Divergent control flow within a warp is supported through var-ious hardware mechanisms (e.g., prediction and related approaches). Tensof warps are grouped into a thread block and can communicate via a faston-chip scratchpad memory (called “shared memory” in NVIDIA terminol-ogy). The warps in a thread block are active at the same time and fine-graininterleaved upon a single SIMD core. A small number of thread blocks fromthe same kernel may be scheduled at once on the same SIMD core. TypicalGPUs contain tens of SIMD cores so that a hundred thread blocks and tensof thousands of scalar threads can be executing concurrently.GPU vendors provided APIs to allow the programmers to write gen-eral purpose applications on them (GPGPU). There are few programminglanguages that can be used to write the general purpose applications thatare targeted to run on the GPUs; examples are Nvidia CUDA [28] fromNVidia and OpenCL [12] from AMD. Recently Microsoft announced theC++ Accelerated Massive Parallelism (C++ AMP) [22], which is a C++language extension for developing applications on data-parallel hardwaresuch as GPUs. Microsoft ships C++ AMP in Visual Studio 2012 IDE thatcontain the VC11 compiler. From Intel LLVM ShevlinPark project [37]“Our (subjective) experience: Writing data parallel code using C++ AMPis highly productive”, we confirm this observation in our experience.C++ AMP is an open specification that can be implemented by anyvendor. Microsoft implements C++AMP on top of its DirectCompute andthe High-Level Shader Language (HLSL). There are many features thatmake C++AMP excellent choice, and improve the programming productiv-ity. Some of these features are:• C++ functions can be marked with the keyword “restrict” to run onthe CPU, the GPU, or both.• The kernel starts at parallel for each library call that is very close tothe parallel for from the MS PPL library.• The use of lambda expression and the function object to write theGPU kernel make it easier and faster to compose and reuse code.• The use of the “tile static” variables and the synchronization “tile barrier::wait”alow for easy use of the shared memory on the GPU.• C++ AMP comes with great libraries such as: “Concurrency::fast math”,“Concurrency::precise math”, and “Concurrency::amp algorithms”.182.3. Summary• Using “array” and ”array view”objects to move the data between theCPU and the GPU easily and transparently to the developer. Also,no change needed to the code when running on integrated CPU-GPUon the same chip where data copy is not required.• Using “accelerator” and ”accelerator view”objects to decided where toexecute the code make it easily to switch between the GPU hardwareand the emulator.• C++ AMP comes with excellent tools including the software emulatorfor debugging.GPUs such NVIDIA’s Kepler and Maxwell [3, 29] have been shown toincrease throughput by up to an order-of-magnitude versus well optimizedmulticore CPU code [19]. This is possible because they dedicate more areato computation rather than complex instruction scheduling hardware andcache. Furthermore, the latest NVIDIA family of GPUs, Maxwell [3] isdesigned with power efficiency as its top priority. Maxwell architecture gives2X performance per watt compared to its predecessor.2.3 SummaryIn this chapter, we provided the necessary background of the web searchengines and the GPUs. We showed that the web search engine query pro-cessing requires three main computation steps: Feature extraction (FE),model input transformation (MIT), and model evaluation (ME). In follow-ing sections, we will also refer to the FE and MIT combined as Model inputevaluation (MIE). The next chapter talks about our implementation, fol-lowed by the experimentation and the results chapters.19Chapter 3Web Search on GPUsAs described in Chapter 2, the query processing workflow consists of twomain phases; document selection and document ranking. Both phases in-volve the following steps; Reading the index (ISR), decompression, Featureextraction (FE), model input transformation (MIT), and model evaluation(ME). We will discuss the decision of offloading modules to the GPUs, andthen we will describe the design and optimization for offloading each mod-ule. Next, we will talk about GPU memory management. Finally, we willtalk about the deployment to production and failure handling.3.1 Determining What to Offload to the GPUWe started by profiling the index server service that showed that a signif-icant share of the CPU time is spent on Feature extraction, model inputtransformation, and model evaluation.FE consists of multiple calculations. On average each feature performsone ALU operation per word in the request document. The CPU performstens to hundreds of thousands of addition and multiplications per query toextract the document features. This amount of ALU operations makes FEALU bound process. The MIT includes two operations per feature for thesingle feature transformation and multiple compare operations per multi-feature transformation. The ME consists of many addition and multipli-cations operations based on the model size. These steps are computationbound and not I/O bound, and they seem like a good candidate to be of-floaded to an accelerator as other previous work tried accelerating similarmodules [33].We decided to offload the MIT and ME computation first because it hasa high computation demand with less memory access. In this thesis, we alsoevaluate offloading FE.For ranking model selection, we cover a large number of ranking modelsincluding neural nets with decision tree inputs. For feature extraction, wecover 125 features and some dependent features like the BM25 family offeatures.203.2. GPU Design OverviewCPUGPUF E - GPUMultiple Document StreamsMIT -GPUM E -GPUGPU - FeatureVectorCPU - FeatureVectorModel inputsDocuments scoresF E - C PURequest ManagerFigure 3.1: GPU implementation overview.Any other features or models that are not currently supported in theGPU implementation are executed in the CPU, and the result is forwardedto the GPU when needed.In the following sections, we describe the challenges of offloading thecomputation of each of these steps into the GPUs and our optimizationsand implementation.3.2 GPU Design OverviewFigure 3.1 shows the overview of our implementation. The CPU sends therequests to the GPU feature extraction FE-GPU. At the same time, theCPU does the CPU feature extraction for other features FE-CPU. Thenthe CPU sends the software evaluated features “CPU-FeatureVector” to theGPU model input transformation MIT-GPU. The GPU executes the FE-GPU then forward the resulting “GPU-FeatureVector” to the MIT-GPU.Next, the GPU executes the MIT-GPU and forwards the results to ME-GPU. Next, the GPU executes the ME-GPU. Finally, the CPU pulls thedocuments scores from the GPU memory.213.3. Feature Extraction on GPUDoId :123StreamId : 1Phrase : 1 Count : 1 Length :10word : 0 o f f s e t 10word : 0 o f f s e t 12word : 1 o f f s e t 13. . . and so on f o r the other streams and phrases .Figure 3.2: Example FE request.3.3 Feature Extraction on GPUFeature extraction (see section 2.1.4) on GPUs is challenging because theFE logic is represented by finite state machine FSM. It is hard to parallelizethe FSMs execution because the execution pattern is serialized. Partitioningthe FSMs input and creating a thread per input partition does not provide aperformance benefit. This poor performance happens because of the inher-ent dependencies in execution. For example, the second thread start state isthe first thread end state. The second thread has to wait for the first threadto finish that leads to serializing the work. We overcome some of these de-pendency issues by using the observation that the document is divided intostreams and most of the features operate on each stream independently.We will describe our FE-GPU request format then we will describe theFE optimizations that we implemented. Each document consists of multi-ple document streams. Moreover, each stream consists of multiple phrases.Also, each phrase includes multiple word hits. Word hits mean where didthe user keywords appeared in the phrase. The query and document pairarrive at the Request Manager. The Request Manager open the ISR andcreates the FE requests on the CPU. The FE request contains the hit vectorwhich mean where do the query terms appear in the document.An example of a document request is:Query: “Lazy Dog” (Word:0 is “Lazy” and Word:1 is “Dog”) The GPU-FE request to extract the features for document identifier 123 is shown inFigure 3.2.This request means document identifier is 123 which can be somethinglike: could be the body section of that web page.Phrase:1 is one phrase in the document.223.3. Feature Extraction on GPUDoc HeaderPhrase CountPhrase HeaderWord CountWord1 Word2 ...Stream CountStream HeaderFigure 3.3: GPU FE serial request format.Word:0 “Lazy” appeared at location 10 in the first phrase.Word:0 “Lazy” appeared at location 12 in the first phrase.Word:1 “Dog” appeared at location 13 in the first phrase.and so on ...The first, un-optimized serial approach is to send one document at a timeto the GPU. The data structure for this single document is a one-dimensionalarray. The GPU has a single thread that loops over the input document toevaluate each feature. This approach is the GPU baseline. There is twotype of data that we need to move to the GPU memory: The ranking modelspecific data and the query/document specific data. The ranking modelspecific data for FE can be moved once per ranking model GPU loading.More on that later in the section that talks about the GPU models memorymanagement. The query/document specific data have to be moved perquery. In case of FE, the data sent to the GPU is the query/documentrequest.The factors that affect the execution time for one document are:• Time to copy the document• Time to launch the kernel• The execution time for each feature multiplied by the size of the doc-ument.The request format for the GPU baseline is described in figure 3.3. Wesend the document header followed by the number of streams. Next is thestream header followed by the number of phrases. Next is the phrase headerfollowed by the number of words. Finally the list of the words.The next optimizations exploit a higher degree of parallelism; they changethe memory access pattern of ranking one request, and they combine mul-tiple requests to gain more parallelism.The first optimization that is specific to the ranking problem leveragesthe observation that a single document contains multiple streams, and each233.3. Feature Extraction on GPUDoc HeaderPhrase CountPhrase HeaderWord CountWord1 Word2 ...Stream Header 1Phrase CountPhrase HeaderWord CountWord1Stream Header 2Phrase CountPhrase HeaderWord CountWord1 Word2 ...Stream Header 3Word3Figure 3.4: GPU FE request format-parallel is independent with respect to many features. Figure 3.4 show therequest format for this optimization. In this optimization, we parallelize thedocument based on the number of streams that it contains. When we executethe feature extraction on the GPU, we launch one thread per stream. Somestreams are much larger than other streams. In this approach, the latencyof the request depends on the largest stream. One way to increase the GPUutilization is to combine two or more small streams together when their totalsize is smaller than the largest stream.The second optimization is executing each feature in a different warpto allow multiple features to execute at the same time. This method canreduce the GPU utilization because some warps have unused threads, butit reduce the request latency. The warp size is 32, and we make each 32threads handle executing one feature. Furthermore, by combining the firstand the second optimization, each 32 threads are executing multiple streamsfor one different feature.The obvious final optimization is batching multiple request to the GPUthat contains a group of documents to be ranked together. This approach re-quired the introduction of a request queue. Multiple software worker threadsfill the queue, and one software thread reads the queue periodically (based ona threshold) batch the requests and send them to the GPU. This approachallows us to fall back to CPU only execution if the number of requests perperiod is not enough for the GPU to get a benefit. Also the fall back toCPU can be used in the case of GPU hardware or software failure, and wewill discuss this in the failure handling section. In the results chapter, wewill show the number of documents that the GPU implementation is betterthan the CPU. This number depends on the ranker complexity as we willexplain there.243.4. Model Input Transformation3.3.1 Dependent Feature ExtractionSome features depend on other features to be calculated first. An example ofsuch a feature, we implemented the well known BM25F [35] ranking function;BM25F features are dynamic document ranking functions; they rank thedocument relative to the query terms. The simple form of this function usesthe number of query term occurrences in the document’s stream (whichis another feature) and the size of the document’s stream to calculate thescore. Our implementation of BM25F uses two-dimensional GPU array. Thefirst dimension is the number of query words; the second dimension is thenumber of documents to be ranked. For each query word, we calculate theadjusted term frequency for that word and the inverse document frequency(IDF). Next step, we use the Word0 thread (the thread that handles thecomputation of first query word) for each document to aggregate the data,sum the IDF and calculate the BM25 initial sum. Finally, we scale theBM25F and calculate the normalized version too.Once the document features are generated in a “feature vector” form, theranker performs a transformation on this feature vector, called the modelinput transformation.3.4 Model Input TransformationAfter evaluating FE-GPU, FE-CPU and copy CPU feature vector to theGPU memory, the resulting GPU feature vector, and the copied CPU featurevector are combined in the GPU memory. The result is the document featurevector “V” that is the input to the second step; Model input transformation.In this phase, a feature vector V is transformed to V’. The single andmultiple feature transformations are challenging to offload to the GPU be-cause of their memory and control divergence. The memory access patternsare irregular specially for the multiple feature transformations because eachelement in V’ is a different transformation and gets calculated using multipledistanced values from the original feature vector V. Next, we will describeour data structure design and the transformation implementation and ourproposed optimizations.The data structure of the MIT contains the transformation type, thetransformation calculation parameters, and the index of the feature in thefeature vector. The data structure is described below:The transform types are shown in code 3.5,. These transformations arethe single feature transformation: Linear, LogLinear, and Bucket. Moreover,the multiple features transformations: DecisionTree. The dummy transfor-253.4. Model Input Transformationenum GPUTransformType{Dummy,Linear ,LogLinear ,Bucket ,Dec i s ionTree} ;Figure 3.5: Model Input Transformation Types.s t r u c t GPUModelInputTransformation{GPUTransformType m transform ;unsigned i n t m featureIndex ;f l o a t m param1 ;f l o a t m param2 ;} ;Figure 3.6: Model input transformation structure.mation is used for padding in the warp optimization that will be describedlater.For each transformation in the final feature vector, there is one structurethat represent it as in code 3.6. m transform specifies the transformationtype. m featureIndex specifies the index of the feature that needs to betransformed.The transformation parameters are param1 and param2. These twoparameters have a different interpretation based on the transform type. Ta-Transformation Type Param1 Param2 FeatureIndexDummy N/A N/A N/ALinear Slope Intercept Input feature’ indexLogLinear Slope Intercept Input feature’ indexBucket LowerLimit UpperLimit Input feature’ indexDecisionTree N/A N/A Index of tree toplogy arrayTable 3.1: MIT param1, param2 and FeatureIndex meaning per transfor-mation type.263.4. Model Input Transformations t r u c t GPUTreeNode{unsigned i n t m treeFeatureIndex ;unsigned i n t m treeNodeLTE ;f l o a t m treeNodeThreshold ;} ;Figure 3.7: TreeNode structure.f o r ( unsigned i n t i = 0 ; i < numberOfInputs ; ++i ){f l o a t f ea tureVa lue=p input [m GPUMIT[ i ] . m featureIndex ] ;switch (m GPUMIT[ i ] . m transformationType ){case TransformationType : : L inear :OutputFeatureVector [ i ] = ( f ea tureVa lue ∗m neuralInputParameters [ i ] . m param1 ) +m neuralInputParameters [ i ] . m param2 ;break ;case TransformationType : : LogLinear :. . .}}Figure 3.8: MIT unoptimized.ble 3.1 shows the meaning of them at various transformation types.Decision trees transformations are handled differently. For decision trees,the FeatureIndex is a pointer to an offset in the tree topology structureshown in figure 3.7. Each node in the tree has one GPUTreeNode struct. Inthis struct, m treeNodeThreshold is the node threshold. m treeFeatureIndexis the index of the feature that will be compared against the m treeNodeThreshold.m treeNodeLTE is the index of the left child node, and the index of the rightchild node is m treeNodeLTE + 1;The simple GPU implementation is a serial implementation. For eachelement in the vector V’, it calculates the transformation. The code in 3.8shows the initial implementation.The first optimization is to launch a thread per transformation.The second optimization is to rewrite the ranker description file to re-ordering the transformation to group the similar transformations together.273.4. Model Input TransformationI(0)I(1)I(N )302xFV (0)FV (1)FV (2)FV (3)FV (F)FeatureVectorContains the computed feature valuesFN (0)FN (1)FN (2)FN (m)Model Inputs , each input can reference multiple featuresFeatureMap maps feature name to index of the FeatureValue in the FeatureVectorFigure 3.9: Model inputs reference features in feature vector using the fea-ture map.The ranker rewriter analyzes the input ranker description file and builds listsof transformation groups. It uses that groups of transformations to rewritethe file, so the similar transformation types appear adjacent to each other.It also has to swap the model weights to reflect the inputs reordering.The third optimization is to add padding using the Transformation-Type::Dummy to make each warp of 32 threads execute one type of trans-formation. This allows multiple transformations to be executed at the sametime.The fourth optimization is to make each transformation group of threadsfrom V’ use features close together in the input feature vector V. To achievehigh memory access bandwidth, the data accessed by different threads in awarp should be nearly adjacent so memory coalescing can be more effective.We achieve this by modifying the feature map. The feature map is the objectthat assigns the feature indexes in the feature vector. It works as follows:each model input references one or more document feature(s) by askingthe feature map about their index in the feature vector. Figure 3.9 showsmodel inputs Ii accessing some feature values FVii using the “feature map”name to index mapping from FNx to FIy. The feature map is importantin achieving high memory access bandwidth because it utilizes the spatial283.4. Model Input Transformationp a r a l l e l f o r e a c h ( gpuAcce lerator v iew , GPUEvaluatedInputs . extent ,[& ] ( index<2> idx ) r e s t r i c t (amp){i n t documentNumber = idx [ 0 ] ;i n t inputNumber = idx [ 1 ] ;f l o a t s l ope = GPUMIT( inputNumber ) . m param1 ;f l o a t i n t e r c e p t = GPUMIT( inputNumber ) . m param2 ;GPUTransformType transformationType =GPUMIT( inputNumber ) . m transform ;UINT32 f ea tu re Index =GPUMIT( inputNumber ) . m featureIndex ;i f ( transformationType == TransformationType : : L inear ){f l o a t f ea tureVa lue = ( f l o a t ) GPUFeatureVector ( documentNumber , f ea tu r e Index ) ;GPUEvaluatedInputs ( documentNumber , inputNumber ) =( featureVa lue ∗ s l ope ) + i n t e r c e p t ;}. . .}Figure 3.10: MIT optimized.locality, e.g., adjacent model input threads access adjacent feature values inthe feature vector.The last optimization is to batch a group of documents together. Ourimplementation utilizes a thread per final transformation. The code in fig-ure 3.10 shows the final optimized code.The data copy from the CPU to the GPU is critical for performance. Weexperimented with different precisions (single point vs. double point) andachieved a very close numbers compared to the CPU-only ranking and betterthan Fixed point computations. The data copy from the main memory tothe CPU includes:1. Moving the CPU feature vectors to GPU memory:The size of the model inputs is N, the feature vector size for eachdocument is F, where N 6= F because multiple inputs can referencethe same document feature, and some inputs can reference multipledocument features. When GPU ranker evaluates D documents, wehave to move D * F * 4 bytes (feature value) from main memory tothe GPU memory. This data moving happens each time a batch of293.5. Model Evaluationdocuments needs to be ranked.2. Moving the ranker input definitions to the GPU memory:The layout of the in-memory data structure that represents the rankeris critical. This is because it changes the memory access pattern whenevaluating the rank of each document, we experimented with a differ-ent layout, e.g., an array of structures vs. a structure of arrays [11].For each model input, we have to copy the model input transformationfor each model input. We use a common structure shown in figure 3.6for the single and multiple features model inputs but with differentfields meaning. Alternatively, we tried the structure that hold arraysof m transform, m featureIndex, m param1, and m param2.All the trees data are moved to the GPU in a vector of GPUTreeNodeshown in figure 3.7. For each tree input, the m featureIndex in theGPUModelInput structure carries the offset of the root tree node inthe GPUTreeNode vector. The field m treeFeatureIndex, on line 3,is the feature index for the feature that this tree node needs to checkagainst its threshold.The field m treeNodeLTE on line 4 represents the tree topology bygiving the index of the left child node. When a tree node check thefeature against its own threshold, if feature value is less than or equalthe threshold, the next tree node index in the GPUTreeNode vector ism treeNodeLTE + offset, otherwise: m treeNodeLTE+ 1 + offset.The field m treeNodeThreshold, on line 5, carries the tree node thresh-old, if the m treeFeatureIndex is -1 then this node is a leaf node andwe use the m treeNodeThreshold to store the leaf node output value.The model input definition data is moved to the GPU once per modelload.3.5 Model EvaluationAs we described before, document ranking uses two well-known machinelearning algorithms; neural network and decision trees. The ranking modelcombines the algorithms together. In the following section, we will focus onneural networks that have some of their neural input made as decision trees.303.5. Model EvaluationI (0) I (1) I (N)W(0,0)XW(1,0)W(N,0)=W(0,1)W(1,1)W(N,1)W(0,N)W(1,N)W(N,N)H (0)H (1)H (N)Input LayerWeights between Input Layer and Hidden LayerHidden LayerFigure 3.11: Modeling neural network evaluation as matrix multiplicationfor single document.3.5.1 Neural Network EvaluationFor the neural network implementation, most of the work on neural networkevaluation on the GPU involves translating the neural network to matrixmultiplication as in [31]. Matrix multiplication maps very well in GPUs.The first approach is to rank each document individually. Using theneural network from Figure 2.7, Figure 3.11 shows how to rank the singledocument. The input is the array of size N neural inputs I(0) to I(N). Wemultiply this by a matrix of size NxN, which represents the weights betweenthe input layer and the second layer (or the hidden layer). The result arrayof N represents the output at the first layer of the neural network. Thisprocess carries on for each layer of the neural network until we reach thelast layer. For example, for a neural net with three layers this multiplicationhave to be done three times. Depending on the size of each layer of theneural net, this implementation can be very expensive compared to theCPU version. The larger the neural net layer size, the more efficient is theGPU implementation. To get the benefit of the massive parallelism in theGPU hardware, we rank multiple documents at the same time.We expand the first approach to evaluating the neural network model formultiple documents in parallel. Figure 3.12 shows evaluating D documentson the neural network in Figure 2.7. The input is a matrix of size DxN neuralinputs, and the output is a matrix of size DxN that represent the outputof the first layer. After evaluating the last layer of the neural network, theresult is an array of size D that contains the ranks of the documents. Thisarray get sent back to the CPU. The code in figure 3.13 shows the neural313.5. Model EvaluationW(0,0)XW(1,0)W(N,0)=W(0,1)W(1,1)W(N,1)W(0,N)W(1,N)W(N,N)H(0,0)H(0,1)H(0,N)Input Layer for multiple documentsWeights between Input Layer and Hidden LayerHidden Layer for multiple documentsI(0,0) I(0,1) I(0,N)I(1,0) I(1,1) I(1,N)I(D ,0) I(D ,1) I(D ,N)H(1,0)H(1,1)H(1,N)H(D ,0)H(D ,1)H(D ,N)Figure 3.12: Modeling neural network evaluation as matrix multiplicationfor multiple documents.p a r a l l e l f o r e a c h ( gpuAcce lerator v iew , GPUresultGPU . extent ,[& , numberOfInputs ] ( index<2> idx ) r e s t r i c t (amp){i n t documentNumber = idx [ 0 ] ;i n t nodeNumberinLayer0 = idx [ 1 ] ;f l o a t sum = GPUWeights ( nodeNumberinLayer0 , 0 ) ;f o r ( unsigned i n t i = 0 ; i<numberOfInputs ; ++i ){sum += GPUEvaluatedInputs ( documentNumber , i ) ∗GPUWeights ( nodeNumberinLayer0 , i +1);}GPUresultGPU( idx ) = sum ;} ) ;Figure 3.13: Neural Net evaluation of one layer of the for multiple docu-ments.323.6. GPU Models Memory Managementnet evaluation of one layer of the for multiple documents.We use the optimized version of C++AMP matrix multiplication [11]which utilize the GPU shared memory for fast accessing the repeated ac-cessed data.3.6 GPU Models Memory ManagementAs described before the data that we move to the GPU are two types:1. Model-dependent data. This data get loaded once per model load, andit is reused for each query. The model-dependent data are the MITdata including the trees topology and the neural network descriptiondata.2. Model-independent data which is the document/query data. This datais per query, and it has to be sent to the GPU for each query.The neural network description data is the weight matrices. There isone weight matrix per neural network layer of the model. These matricesneed to be moved to the GPU before the evaluation of the model on thedocuments by the CPU. A part of the GPU global memory is assigned tohold the models data. Multiple models can be loaded in the GPU memoryat the same time. The CPU keeps track of the loaded models in the GPUmemory and pass the pointer to the GPU kernel to evaluate the right model.If the required model is not loaded in the GPU memory, the CPU copy thedata to the GPU with the first query that required the “unloaded” model.If the model data is not loaded in the CPU memory, the CPU has to loadthe data from a ranker’s description file that exists on the disk. The CPUloads these weight matrices and sends them to the GPU to be cached inthe GPU memory. The weight matrixes stay in the GPU memory until it isevicted/replaced by the CPU. The module that manages the GPU memoryis called the GPU ranker cache manager. The GPU ranker cache manageruses LRU to manage the GPU memory.3.7 Software Deployment and Failure HandelingThe GPU accelerated index serving software deployment is not much differ-ent from the non-accelerated one. The introduction of a new hardware (TheGPU) in each server requires driver installation and upgrade which can behandled like the operating system (OS) update.333.8. SummaryFor the failure handling. Each cluster of ISNs should contain some ma-chines with GPUs dedicated to replacing failed machines, these machinesare called spare machines. In case of software failure, the same existing fail-ure handling mechanism should kick in to restart or reimage the machine.In the event of a hardware failure (GPU), a decision has to be made. Oneway to handle that is to fall back to the CPU software only. Another way isto declare the machine as failed machine and use a machine from the sparepool.For production, some infrastructure has to be made. This infrastructureincludes perf counters, health monitors, alerts, and reports for the GPU.3.8 SummaryIn this chapter, we described the challenges of accelerating the query pro-cessing steps (FE, MIT, and ME) on the GPU. Then we described our designand how we overcome some of these challenges. We described in detail ourimplementation of the FE, MIT, and the ME steps, and we gave a list ofthe optimizations that we applied on each of them. Some of these optimiza-tions are specific to the index serving problem. While other optimizationsare general and can be used in other GPU acceleration applications. Weconcluded this chapter by talking about handling multiple rankers in theGPU memory and talking about the software deployment and the failurehandling.The next chapter talks about our experimentation methodology, followedby our experimentation results.34Chapter 4Experimental MethodologyWe implemented the GPU acceleration in a commercial web search engine.All the experiments that we ran on this search engine use real productiondata. We used the current system as a baseline for the CPU-only implemen-tation. The baseline CPU implementation is highly optimized productionimplementation. We compared the baseline to our CPU-GPU implementa-tion.The execution time is measured using the Microsoft Windows high-resolution performance timer API [26]. The GPU implementation is writtenusing Microsoft C++ AMP, VC12 C++ compiler and Visual Studio 2013.See Table 4.1 for details of the hardware used during our evaluation.We run a production query log that we obtained from a productionmachine. We run hundred thousand queries/documents per model.For performance/execution time experiments we run the GPU kernel onetime first to allow the JIT to optimize the kernel, this happens only once perranking service start. After the JIT optimizes the kernel, we run each exper-iment 10 times, and we measure the data copy for GPU and execution timeof the CPU and the GPU. We used the Microsoft Windows high-resolutionperformance timer API [23] to measure the copy and execution time. Wereport the average of these runs.For analysis experiments, we use Visual Studio “Concurrency Visual-izer” profiler [10]. To read the hardware counters we use the NVIDIA Per-fKit 3.1 [30]. We instrument the driver and use the C++ AMP DirectXinterop to obtain the ID3D11Device object (which is a device representa-tion in DirectX). We feed the ID3D11Device to the NVIDIA PerfKit to readthe device counters by injecting code to read the hardware and the drivercounters.For the power measurements, we use the GPU-Z tool [39] to read thehardware sensors and log the power data to a file. The sample rate is 1per second. For more accurate power measurements, we use the “WattsUp Pro” hardware tool [7] to measure the full system power. The samplerate is 1 per second. For the power experiments, we run a constant load of10000 documents per query for 20 seconds, and we record the average power35Chapter 4. Experimental MethodologyCPU:Name Intel Xeon W3690 (Westmere-WS) @ 3.47GHzTechnology 32 nmTDP Limit 130 WattsL1 Data cache 6 x 32 KBytes, 8-way set associative, 64-byte line sizeL1 Instruction cache 6 x 32 KBytes, 4-way set associative, 64-byte line sizeL2 cache 6 x 256 KBytes, 8-way set associative, 64-byte line sizeL3 cache 12 MBytes, 16-way set associative, 64-byte line sizeMemory Type DDR3 Triple ChannelsMemory Size 24 GBytesChipset:Northbridge Intel X58 rev. 13Southbridge Intel 82801JR (ICH10R) rev. 00PCI-E Max Link Width x16GPU Kepler:Name NVIDIA GeForce GTX 660 Ti (GK104) @ 980MHzTechnology 28 nmCUDA Cores 640Memory size 2 GB @ 6.0GbpsMemory Interface GDDR5Memory Interface Width 192-bitMemory Bandwidth 144.2(GB/sec)Maximum TDP 150 WattsGPU Maxwell:Name NVIDIA GeForce GTX 750 TiTechnology 28 nmCUDA Cores 1344Memory size 2 GB @ 6.0GbpsMemory Interface GDDR5Memory Interface Width 192-bitMemory Bandwidth 86.4(GB/sec)Maximum TDP 60 WattsTable 4.1: Hardware specs.36Chapter 4. Experimental MethodologyRanker Name R-S-BLG R-M-BGD R-L-DNumber of Input 125 515 1250Number of Layers 1 1 2Inputs Types:Bucket (B) 54 10 0Linear (L) 2 0 0Log Linear (G) 69 5 0Decision Tree (D) 0 500 1250Table 4.2: R-S-BLG, R-M-BGD, R-L-D ranker details.consumption.In the first power experiment, we measure the system idle power con-sumption with no GPU (around 70 watts). Then we run the CPU imple-mentation only for the three ranker types (described below), and we recordthe system power consumption. Finally, we subtract experiment power fromthe idle system power to get the CPU-only power.In second power experiment, we add the GPU to the system, and wemeasure the full system power under constant load for the three rankersand we subtract that from the system idle power with no GPU.For each experiment, we compare the document’s rank output of boththe CPU and the CPU-GPU implementation for correctness testing.The index server has a ranker store that contain production and ex-perimentation rankers. By analyzing and parsing a large number of bothproduction and experimentation rankers, we decided to include three typesof rankers based on their size. The selected three rankers represent a largernumber of other slightly different rankers. The actual production rankerschange over time, and this is a snapshot at the time of the experimentation.We will refer to them as R-S-BLG, R-M-BGD, and R-L-D.Table 4.2 shows these rankers’ parameters. R-S-BLG is simple ranker,then R-M-BGD is an intermediate complexity ranker. Finally, R-L-D isbigger and more complex ranker.R-S-BLG represents the fast and small size ranker with 125 modelinputs; these 125 model inputs have 125 single feature model input trans-formation without any decision trees. these single feature transformationsare B: bucket, L: Linear, and G: Log-linear.R-M-BGD is an intermediate size ranker with 515 model inputs. 15model input are the simple single feature model input transformation. B:37Chapter 4. Experimental Methodologybucket and G: Log-linear. It also has 500 decision trees (D: Decision Tree).Each one of these decision trees has N requested features.R-L-D is a large and complex ranker with 1250 model inputs. These1250 inputs are multiple features decision trees transformations (D: DecisionTree) without any single feature model input transformations.For each ranker, we run the single thread CPU, parallel CPU, KeplerGPU, and Maxwell GPU. The single thread CPU is the current search engineimplementation. In this implementation, each query runs on a single thread,and multiple queries can run on parallel. We use 12 threads, and eachthread ranks all the documents in serial per query. When a thread finishesranking all the documents in a query, it gets a new query from queries queue(this is called Work Stealing Queue). The parallel CPU is a multithreadedimplementation, and it is not implemented in the production code for manyreasons out of the scope of this thesis. We implemented it to compare withthe GPU implementation. In this implementation, we use 12 threads perquery. We pick query from the queries queue and run the 12 threads to rankthe documents in parallel for the same query. This effect the query latencybut it doesn’t effect the CPU throughput. The GPU-Kepler and GPU-Maxwell are the same implementations on different GPU hardware. Thisimplementation is described in detail in chapter 3. Please see the resultsection for comparisons.38Chapter 5Experimental ResultsFigure 5.1 shows the execution time for the simple R-S-BLG. We observethe following. First, ranking one document is faster on the CPU. The GPUimplementation has overhead which includes; the time it takes to copy datato the GPU, queue a GPU kernel, run the kernel, and copy back the data,which is much more overhead than simply running on the CPU. The tradeoffchanges when ranking 1000 documents where GPU total execution timebecomes better than the CPU. In the a real system, the selection phasethat uses small rankers like the R-S-BLG typically ranks more than 1000documents. For 1000 documents and more we see the benefit of the GPUaccelerated ranker.Figure 5.2 shows details of the GPU execution time. In this figure,execution time is divided into five categories. The first three categories fromthe bottom are for the data copy from and to the GPU. Copy-FV-MI meanscopying the feature vector and the model inputs transformations. Copy-M means copying the model data. Copy-R means copying the resultingdocuments rank score back to the CPU memory. GPU-IE means the modelinput evaluation and GPU-ME means the model evaluation itself. FromFigure 5.2 we can see that data copy can take up to 68% of the GPUexecution time. This copying time can be avoided by using integrated GPUlike AMD’s Fusion APUs [2] the evaluation of which is beyond the scopeof this work. Also, we notice that the model description copy “Copy-R”can be negligible at a higher documents number that suggested that we donot need to cache them as we do in the CPU. A fast ranker like this one isused already to rank a high number of documents, and GPU latency savingcould be used to traverse more index documents, which leads to servingbigger index.Figure 5.3 shows the execution time for the intermediate complexityR-M-BGD. We observe the following. First, it follows the same pattern asR-S-BLG, but the switching point where the GPU is performing better thanthe CPU happens at a smaller numbers of documents. At 500 documents,the GPU latency starts to be better than the CPU. This shows that whenthe ranker complexity increases, CPU performs lower and the GPU run39Chapter 5. Experimental ResultsFigure 5.1: R-S-BLG execution time on the CPU and the GPU.Figure 5.2: Details of R-S-BLG normalized execution on the GPU.40Chapter 5. Experimental ResultsFigure 5.3: R-M-BGD execution time on the CPU and the GPU.Figure 5.4: Details of R-M-BGD normalized execution on the GPU.415.1. Power Consumptionoverhead effect decreases at a smaller number of documents. This resultsuggests that the GPU is more beneficial when the ranking models are morecomplex.From 5.4 we can see that the time taken for data copy percentage islower, and the kernel execution is higher. This happens because decisiontrees evaluation requires more computation than simple single feature trans-formation. Furthermore, each tree introduces very different memory accesspattern than the single feature model input. In the single feature modelinput evaluation, each input transformation thread required single featurefrom the feature vector and these features are adjacent. On the other hand,the single tree requires multiple features that are multiple places in the fea-ture vector and not necessarily adjacent. These reasons cause the GPU-IEcomponent to be dominant. The model in R-M-BGD is still simple that iswhy GPU-ME is smaller.Figure 5.5 shows the execution time for the complex R-L-D. It followsthe same pattern as R-S-BLG and R-M-BGD, but again the switch pointhappens at an even smaller numbers of documents. At 50 documents, theGPU starts to achieve lower latency than the CPU. The data in figure 5.6shows that trees are dominating the execution time “GPU-IE”. We reach19× improvement in latency for complete batch on GPU at 10,000 doc-uments because GPU memory throughput is much higher than the CPU.Also, it important to observe that for the CPU version it was impossibleto use such a complex model to rank more then 500 documents per query(70.61 ms). That is why such a ranker will be used in the “second-stageexpensive ranker” phase. On the other hand, for the GPU implementationit can be used to rank 10000 documents. In fact, with the GPU implementa-tion this could give the relevance developer and researchers an excellent toolto allow them to run more complex rankers on the “Fast first-pass ranking”in ranking phase.5.1 Power ConsumptionUsing GPU-Z, the idle power consumption for Nvidia Geforce GTX 660TI(Kepler) is around 12.9% TDP (19.35W). And the idle power consumptionfor GTX 750 TI (Maxwell) is around 2.4%TDP (1.44W). The full systemidle power is around 67W measured using “Watts up pro” Figure 5.7 showsthe full system average power consumption for ranking 10000 documents forthe three ranking model types. The CPU power consumption is worst for theR-S-BLG compared to the GPU. This happens because this Ranker has an425.1. Power ConsumptionFigure 5.5: R-L-D execution time on the CPU and the GPU.Figure 5.6: Details of R-L-D normalized execution on the GPU.435.1. Power Consumption020406080100120140160180200R- S-BLG R- M- B GD R -L-DAverage power in WattPower consumptionCP U Only Watt Kepler Watt Maxwell WattFigure 5.7: Average Power Consumption per Ranker Typeinsignificant memory and control divergent, and the GPU implementationand execution is very efficient. The bigger rankers with more decision treessuffer from code diversion, and the CPU power consumption is better thanKepler GPU for R-L-G. The Maxwell GPU power consumption is betterthan CPU and Kepler CPU because it is optimized for power efficiency.This result shows the power efficiency gains that GPUs can bring to the datacenter workload such as the web search engine workload. Subtracting thesystem idle power consumption, the GPU can provide 2× power efficiencythan the CPU at 10000 document per query for mid-size ranker like theR-M-BGD.44Chapter 6Related WorkSome recent work examined a ways to parallelize the FSM [48]. Offlinetraining for speculation is not efficient when the input data change. Prefix-sum parallelization uses many threads per data partition and starts from allthe possible states [18, 25]. This approach has scalability issues. E.g., thenumber of useless computations increase linearly with the number of FSMstates. Principled speculation [47] uses probabilistic “make-span” model(profit of speculation) based on offline training.Some work is done to speed up the DNN for speech recognition usingARM and SEE [41, 42]. This work reduces the size of the model and convertsthe floating point to fixed point. While the speed up is great, the qualitycould be affected. Also, for search engines, the same approach could beused.Reddi et al. [15] examined the use of power-efficient mobile cores inthe search engines. While the measured power consumption was lower thisaffects the quality of search.Ding et al. [8] studied using GPU in the web search engine. First, it fo-cuses on the index compression/decompression, list intersection, and simplescoring. With the advancement of web search engines, relevance researchersare using more complicated scoring models for ranking documents. Thepaper mentioned that it is doing the selection only “Thus, our approachcan be seen as implementing the first phase, which aims to select promis-ing candidates that the complete scoring function should be applied to. Incontrast, the second phase has a very different structure and implementingit on a GPU would be an interesting and challenging problem for futurework.” Our work is different in that we target the first-stage fast rankingmodels from the “Selection Phase” as well as the bigger ranking models fromsecond-stage expensive ranking phase “Ranking Phase”.Second, data copy latency is not calculated; they mentioned that it ishappening async while the GPU is busy doing other work, but even thoughit is async, it will affect the query latency. We analyze the ranking with andwithout data copy. This is beneficial in case of the on-chip GPU, like AMDAPUs.45Chapter 6. Related WorkYan et al. [45] studied using FPGAs in web search engine. While theestimated number of served documents “D” by the query per second “Q” isgood, DQ/Dollar = 9.75 compared to baseline 1.36. Moving a big part tothe FPGAs from the general purpose CPU architecture is a big investmentin new hardware, development, debugging and testing. Also the ability tomaintain and change this code on a regular basis is questionable. On theother hand, with our proposal of using the GPU we introduce a change inthe code base from C++ to C++ AMP (not a big change). Furthermore,it costs less per machine, around 300-400Dollars cost of the GPU instead ofthe claimed $4,410.Putnam et al. [33] the paper uses FPGA to accelerate Microsoft the paper reported 2× throughput with 50% less power consumption,this work affected the code development time and the code maintainability.Also, many optimization are made to change floating-point operations tofixed point to round to integers. They did not apply the same optimizationto the CPU for proper compare. Also, this work uses 7-9 FPGAs in a chainof 7-9 servers connected to perform the ranking task (Because the logic ofthe ranking can’t fit in one FPGA). This complicates the deployment andthe fault tolerance story.Recently FPGA vendors started to support OpenCL. Rodriguez-Donateet al. [36] shows the early experiences with OpenCL on FPGAs. While theoptimized code performed well, it suffered from area inefficiency.Some recent work on the GPUs in the data center by Hetheringtonet al. [38] shows that data center workload, e.g., Memcached could be 27%more cost efficient than the CPU.46Chapter 7ConclusionIn this thesis, we examined the use of GPU in the data center to accelerate aregular data center workload, a web search engine. Studying the web searchengine’s query processing stack in depth allowed us to find the candidatemodules for acceleration. Our experimentations reveal the potential to useGPU to achieve vertical scaling with a limited power budget increase. Weshowed that the GPU can be lower latency than the CPU when using com-plex rankers and when it is used to rank a large number of documents perquery. The reduction of the ranking latency and the increased throughputcan be used in four scenarios:• Adding a commodity GPU to each ISN in the current system withoutan increase in the index size allows for processing more documentsin each ISN per query. This counters the negative effect of the earlytermination that happens in some cases by allowing more documentsto be ranked within the current SLA. The documents are sorted in theindex by importance. So including more documents in the selectionphase doesn’t always mean a better result. But in some rare casesthe later documents (document at the end of the index) may be morerelevant to the to the user search query.• Adding GPU per ISN allows a more sophisticated and complex rank-ing models types to be feasible for relevance researchers to use in theselection stage. This is a great tool for relevance researchers to beable to run more complex models on the selection phase instead ofusing the fast first-pass rankers. Currently, they are restricted on thesize and the complexity of the ranking model that they can use in theselection phase. This restriction is because the ranking on that phaseruns on a huge number of documents, and a small increase of rank-ing latency will accumulate for each document. If that happen, thenumber of considered documents will be decreased significantly. Ourexperimentation shows that ranking on GPU can be 19× faster thanon the CPU, which allows new types of these rankers to be used onfast first-pass ranking phase. With deeper layers of machine learning477.1. Future Workmodels comes better ranking that leads to a stronger ranking in fastfirst-pass ranking phase. This allows for a better-reduced set of docu-ments to the second-stage expensive ranking phase that will increasethe quality of the documents and their relevance.• Allows the ISNs to serve a larger index partition. This is because ofthe latency decrease using GPUs that allow the ISNs to consider moredocuments from the index.• Reduce the number of machines in the data center per one full index.This can be achieved by repartitioning the index to bigger chunks inthe current system. Assuming that the I/O and caching are not thebottlenecks.All the previous points can allow for faster ranking on a larger indexwith cost efficient per query that leads to the increase of the search quality.7.1 Future WorkGPUs can be more efficient than the CPU for handling the high paralleldata driven applications. The following list contains more areas that wewould like to examine the possibility of accelerating using the GPUs• Support more types of ranking models and more feature extraction.• Examine the benefit of offloading more compute intensive parts of thequery processing, e.g. the index decompression. Although this seemslike a good candidate, we can’t confirm that this is a compute-boundprocess because it involves I/O too.• Consider accelerating parts of the top level stack of the query process-ing, for example, the aggregator that aggregate the scores from all theISNs.• Another interesting area is offloading parts of the index building stepsto the GPU. Especially with the recent work on using GPUs for thenetwork packets. It could be interesting to examine building crawleron the GPU.48Bibliography[1] Adnan Abid, Naveed Hussain, Kamran Abid, Farooq Ahmad, Muham-mad Shoaib Farooq, Uzma Farooq, Sher Afzal Khan, Yaser DaanialKhan, Muhammad Azhar Naeem, and Nabeel Sabir. A survey on searchresults diversification techniques. Neural Computing and Applications,pages 1–23.[2] AMD Corp. AMD Accelerated Processing Units., 2012.[3] AnandTech. Maxwell: Designed For Energy Efficiency., 2012.[4] Sergey Brin and Lawrence Page. The Anatomy of a Large-Scale Hy-pertextual Web Search Engine. In Proc. ACM Conf. on World WideWeb, pages 107–117, 1998.[5] Irina Ceaparu, Jonathan Lazar, Katie Bessiere, John Robinson, andBen Shneiderman. Determining causes and severity of end-user frustra-tion. International journal of human-computer interaction, 17(3):333–356, 2004.[6] Maurice De Kunder. The size of the world wide web. WorldWideWeb-Size, 2012.[7] Electronic Educational Devices. Watts up pro, 2009.[8] Shuai Ding, Jinru He, Hao Yan, and Torsten Suel. Using GraphicsProcessors for High Performance IR Query Processing. In Proc. ACMConf. on World Wide Web, pages 421–430, 2009.[9] Dennis F Galletta, Raymond Henry, Scott McCoy, and Peter Polak.Web site delays: How tolerant are users? Journal of the Associationfor Information Systems, 5(1):1, 2004.49Bibliography[10] Boby George and Pooja Nagpal. Optimizing parallel applications usingconcurrency visualizer: A case study. Parallel Computing PlatformGroup, Microsoft Corporation, 2010.[11] Kate Gregory and Ade Miller. C++ AMP: Accelerated Massive Paral-lelism with Microsoft Visual C++. Microsoft Press, 2012.[12] Khronos OpenCL Working Group et al. The opencl specification. ver-sion, 1(29):8, 2008.[13] Allan Heydon and Marc Najork. Mercator: A scalable, extensible webcrawler. World Wide Web, 2(4):219–229, 1999.[14] Facts Hunt. Total number of websites & size of the internet as of 2013,2015.[15] Vijay Janapa Reddi, Benjamin C. Lee, Trishul Chilimbi, and KushagraVaid. Web Search Using Mobile Cores: Quantifying and Mitigating thePrice of Efficiency. In Proc. IEEE/ACM Symp. on Computer Architec-ture (ISCA), pages 314–325, 2010.[16] Thienne Johnson, Carlos Acedo, SG Kobourov, and Sabrina Nusrat.Analyzing the evolution of the internet. In 17th IEEE EurographicsConference on Visualization (EuroVis–short papers). Accepted, to ap-pear in, 2015.[17] Jonathan Koomey. Growth in data center electricity use 2005 to 2010.A report by Analytical Press, completed at the request of The New YorkTimes, page 9, 2011.[18] Richard E Ladner and Michael J Fischer. Parallel prefix computation.Journal of the ACM (JACM), 27(4):831–838, 1980.[19] Victor W. Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Dae-hyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy,Srinivas Chennupaty, Per Hammarlund, Ronak Singhal, and PradeepDubey. Debunking the 100X GPU vs. CPU Myth: An Evaluation ofThroughput Computing on CPU and GPU. In Proc. IEEE/ACM Symp.on Computer Architecture (ISCA), pages 451–460, 2010.[20] Christopher D Manning, Prabhakar Raghavan, Hinrich Schu¨tze, et al.Introduction to information retrieval, volume 1. Cambridge universitypress Cambridge, 2008.50Bibliography[21] Maged Michael, Jose E Moreira, Doron Shiloach, and Robert W Wis-niewski. Scale-up x scale-out: A case study using nutch/lucene. InParallel and Distributed Processing Symposium, 2007. IPDPS 2007.IEEE International, pages 1–8. IEEE, 2007.[22] Microsoft Corp. C++ AMP Overview.[23] Microsoft Corp. Microsoft Windows High-Resolution PerformanceTimer API, 2012.[24] Brian H Murray and Alvin Moore. Sizing the internet. White paper,Cyveillance, page 3, 2000.[25] Todd Mytkowicz, Madanlal Musuvathi, and Wolfram Schulte. Data-parallel finite-state machines. ACM SIGPLAN Notices, 49(4):529–542,2014.[26] Johan Nilsson. Timers-implement a continuously updating, high-resolution time provider for windows. MSDN Magazine, pages 78–88,2004.[27] Alexandros Ntoulas, Marc Najork, Mark Manasse, and Dennis Fetterly.Detecting spam web pages through content analysis. In Proceedings ofthe 15th international conference on World Wide Web, pages 83–92.ACM, 2006.[28] CUDA Nvidia. Programming guide, 2008.[29] NVIDIA Corp. NVIDIA’s Next Generation CUDA Compute Architec-ture: Kepler GK110, 2012.[30] NVIDIA Corp. NVIDIA PerfKit., 2013.[31] Kyoung-Su Oh and Keechul Jung. GPU implementation of neural net-works. Pattern Recognition, 37(6):1311–1314, 2004.[32] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd.The pagerank citation ranking: bringing order to the web. 1999.[33] Andrew Putnam, Adrian M Caulfield, Eric S Chung, Derek Chiou,Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fow-ers, Gopi Prashanth Gopal, Jordan Gray, et al. A reconfigurable fabric51Bibliographyfor accelerating large-scale datacenter services. In Computer Architec-ture (ISCA), 2014 ACM/IEEE 41st International Symposium on, pages13–24. IEEE, 2014.[34] Feng Qiu, Zhenyu Liu, and Junghoo Cho. Analysis of user web trafficwith a focus on search activities. In WebDB, pages 103–108. Citeseer,2005.[35] S.E. Robertson, S. Walker, S. Jones, M.M. Hancock-Beaulieu, andM. Gatford. Okapi at trec-3. In Proc. Text Retrieval Conference(TREC), pages 109–126, 1994.[36] C Rodriguez-Donate, G Botella, C Garcia, E Cabal-Yepez, andM Prieto-Matias. Early experiences with opencl on fpgas: Convolu-tion case study. In Field-Programmable Custom Computing Machines(FCCM), 2015 IEEE 23rd Annual International Symposium on, pages235–235. IEEE, 2015.[37] Dillon Sharlet, Aaron Kunze, Stephen Junkins, and Deepti Joshi.Shevlin Park: Implementing C++ AMP with Clang/LLVM andOpenCL. In General Meeting of LLVM Developers and Users, 2012.[38] Tor M. Aamodt Tayler H. Hetherington, Mike O’Connor. Mem-cachedgpu: Scaling-up scale-out key-value stores. 2015.[39] TechPowerUp. TechPowerUp GPU-Z., 2012.[40] WorldOMeters. Internet live stats.[41] Yeming Xiao. Speeding up deep neural network based speech recogni-tion systems. Journal of Software, 9(10):2706–2712, 2014.[42] Anhao Xing, Xin Jin, Ta Li, Xuyang Wang, Jielin Pan, and YonghongYan. Speeding up deep neural networks for speech recognition on armcortex-a series processors. In Natural Computation (ICNC), 2014 10thInternational Conference on, pages 123–127. IEEE, 2014.[43] Song Xing and Bernd-Peter Paris. Measuring the size of the internetvia importance sampling. Selected Areas in Communications, IEEEJournal on, 21(6):922–933, 2003.52Bibliography[44] Hao Yan, Shuai Ding, and Torsten Suel. Inverted index compressionand query processing with optimized document ordering. In Proceedingsof the 18th international conference on World wide web, pages 401–410.ACM, 2009.[45] Jing Yan, Zhan-Xiang Zhao, Ning-Yi Xu, Xi Jin, Lin-Tao Zhang, andFeng-Hsiung Hsu. Efficient query processing for web search engine withfpgas. In Proc. IEEE Symp. on Field-Programmable Custom ComputingMachines (FCCM), pages 97–100, 2012.[46] Jiangong Zhang, Xiaohui Long, and Torsten Suel. Performance of com-pressed inverted list caching in search engines. In Proceedings of the 17thinternational conference on World Wide Web, pages 387–396. ACM,2008.[47] Zhijia Zhao and Xipeng Shen. On-the-fly principled speculation for fsmparallelization. In Proceedings of the Twentieth International Confer-ence on Architectural Support for Programming Languages and Operat-ing Systems, pages 619–630. ACM, 2015.[48] Zhijia Zhao, Bo Wu, and Xipeng Shen. Challenging the embarrass-ingly sequential: parallelizing finite state machine-based computationsthrough principled speculation. In ACM SIGARCH Computer Archi-tecture News, volume 42, pages 543–558. ACM, 2014.[49] Justin Zobel and Alistair Moffat. Inverted Files for Text Search Engines.ACM Comput. Surv., 38(2), July 2006.53


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items