Finding a Record in a DatabasebyBahare FatemiB.Sc., Amirkabir University of Technology, 2015A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinTHE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES(Computer Science)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)August 2017c© Bahare Fatemi 2017AbstractConsider the following problem: given a database of records indexed by names(e.g., of companies or restaurants) and a new name, determine whether the newname is in the database, and if so, which record it refers to. This problem is calledrecord linkage. Record linkage is a challenging problem because people do notconsistently use the official title of a company, but use abbreviations, synonyms,different orders of terms, and the title can contain typos. We provide a probabilisticmodel using relational logistic regression to find the probability of each record inthe database being the desired record for a given query, and find the best record(s).Our model addresses many of challenges of the record linkage problem and pro-vides good results when exact term matching search algorithms fail. We evaluateour model on a large real-world data set. Obtained results show that the model is apromising probabilistic record linkage model.iiLay SummaryHaving duplicates in a database is often a pain for companies for several reasons:• duplicates increase the storage space• duplicates may cause inconsistencies• by increasing the size of the database, duplicates increase the running timeof different tasks on the databaseFinding duplicates is challenging due to the fact that they usually do not looksimilar. Typos, spacing issues, abbreviations, and short forms are example sourcesof dissimilarities. We studied different methods for finding duplicates in a databaseindexed by names and proposed a novel method. Our method is capable of identi-fying duplicates in presence of several dissimilarity sources.iiiPrefaceThis thesis is largely based on collaborative work between UBC and TELUS. Icollaborated with Seyed Mehran Kazemi and Dr. David Poole on the UBC team,and Mike Tyndall, William Wong, and Eric Yuen on the TELUS team. Duringcompletion of this work, we had a great deal of active discussion among bothteams. Knowledge of machine learning algorithms and their implementations wereprovided by the UBC team, and data, as well as expert knowledge about the data,was provided by the TELUS team. Below is a description of specific contributionsof each member in this project.Dr. Poole, my academic supervisor and supervisor of this work, introduced thisresearch topic to me. We had weekly meetings discussing my progress, analyzingthe obtained results, and deciding on the next steps. He guided me through all thesteps, making sure I was on the right track.Mr. Kazemi, Ph.D. candidate working under Dr. Poole’s supervision, actedas an advisor for this work. I consulted with him on various aspects of the work,including the ideas I came up with, the results I obtained, and the problems I faced.He provided valuable feedback and helped me find solutions to several problems.We had weekly meetings with Mr. Wong and Mr. Yuen from TELUS team toinvestigate our progress, analyze the results and obtain feedback on the data. Mr.Kazemi actively participated in all these meetings.Mr. Tyndall, the manager of the TELUS team, introduced this problem to usand provided us with data, expert knowledge, and intuitions on why some ideasmay work or not work on their dataset. He participated in our weekly meetingswith TELUS team once in a month and provided valuable feedback.Mr. Wong and Mr. Yuen, our main source of contact for the TELUS team, metwith us every week. In these meetings, I presented the progress of the work andshared the results. We investigated the results and they provided helpful thoughtsand feedback. We also developed time-lines and made sure we were not behindschedule.The TELUS and UBC teams met each four months. I presented the progressand everyone verified the ideas, results and next steps.During this work, I realized the shortcomings of applying the current recordlinkage algorithms to a large dataset. I came up with various ideas of how toivPrefaceovercome the shortcomings in this thesis. I also implemented and tested the ideas.After preparing the aforementioned contents for the thesis, I designed the struc-ture and wrote the initial draft of the thesis. Dr. Poole and Mr. Kazemi proofreadthe thesis and provide feedback. The TELUS team also provided helpful feedbackon the draft.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 TFIDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . 72.4 Relational Logistic Regression . . . . . . . . . . . . . . . . . . . 72.5 Continuous PRVs . . . . . . . . . . . . . . . . . . . . . . . . . . 93 A Model of Record Linkage . . . . . . . . . . . . . . . . . . . . . . 113.1 A probabilistic TFIDF based model . . . . . . . . . . . . . . . . 113.2 Adding translations . . . . . . . . . . . . . . . . . . . . . . . . . 133.3 Learning Tr(t, t′) . . . . . . . . . . . . . . . . . . . . . . . . . . 153.4 Adding bigrams . . . . . . . . . . . . . . . . . . . . . . . . . . 163.4.1 Query contains the bigram . . . . . . . . . . . . . . . . . 163.4.2 Document contains the bigram . . . . . . . . . . . . . . 163.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 17viTable of Contents4 Empirical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.4 Automating the process of searching for a query . . . . . . . . . 214.5 Thresholding IDF scores . . . . . . . . . . . . . . . . . . . . . . 234.6 Extending the model for documents with more than one field . . . 265 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28viiList of Tables4.1 Some pairs of terms learned by our heuristic for Tr(t, t′): each twoterms in a columns shows a learned pair by our heuristic . . . . . 19viiiList of Figures2.1 An RLR model taken from [28]. . . . . . . . . . . . . . . . . . . 92.2 An RLR model for Result(q, d) . . . . . . . . . . . . . . . . . . . 104.1 hit@k of TFIDF, TFIDF+TR, and TFIDF+TR+BG. . . . . . . . . 204.2 Set differences for the hit@k of (a) TFIDF (C1) and TFIDF+TR(C2), (b) TFIDF and TFIDF+TR+BG (C3), and (c) TFIDF+TRand TFIDF+TR+BG. . . . . . . . . . . . . . . . . . . . . . . . . 224.3 hit@1 of the three methods TFIDF, TFIDF+TR, and TFIDF+TR+BGfor different trust thresholds tt. . . . . . . . . . . . . . . . . . . . 234.4 automation percentage given the trust threshold tt for the threemethods TFIDF, TFIDF+TR, and TFIDF+TR+BG. . . . . . . . . 244.5 hit@k of TFIDF model with different cut-off thresholds of IDFscore of words for different values of k. . . . . . . . . . . . . . . 254.6 Average time to search a query in TFIDF with different IDF thresh-olds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.7 hit@k of TFIDF+TR and TFIDF+TR+PC. . . . . . . . . . . . . . 26ixAcknowledgementsI would like to express my sincerest gratitude to my supervisor and mentor, Pro-fessor David Poole, who has been always supportive and patient with me. Hisguidance and support in academia and life have been truly invaluable.My deepest acknowledgment goes to TELUS team members, Mr. Todd Hoskins,Mr. Mike Tyndall, Mr. William Wong, and Mr. Eric Yuen for the funding they pro-vided for the project and all the support and encouragement they gave me. Withouttheir guidance, constant feedbacks, and valuable ideas, this thesis would not havebeen achievable.My deep and sincere gratitude to my family for their continuous and unpar-alleled love, help and support. I am grateful to my brother, Hossein, for alwaysbeing there for me as a friend. I am forever indebted to my parents, Mahmoud andEsmat, for giving me the opportunities and experiences that have made me who Iam. They selflessly encouraged me to explore new directions in life and seek myown destiny. This journey would not have been possible if not for them.Finally, and above all, I cannot begin to express my unfailing gratitude andlove to my fiance´ and colleague, Mehran, who has supported me throughout thisprocess and has constantly encouraged me when the tasks seemed arduous andinsurmountable.xDedicationThis thesis work is dedicated to my fiance´, Mehran, who has been a constant sourceof support and encouragement during the challenges of graduate school and life.I am truly thankful for having you in my life. This work is also dedicated to myparents, Mahmoud and Esmat, who have always loved me unconditionally andwhose good examples have taught me to work hard for the things that I aspire toachieve.xiChapter 1IntroductionNowadays, many companies offer services that require searching their database fora query text specified by a user. A website containing reviews for restaurants, forinstance, lets its users find their desired restaurant through searching the restaurantname. A website containing scientific papers lets its users find their desired paperthrough searching the title of the paper. A telephone company needs to searchthrough their customer records (individual names or company titles) for customerinquiries. This problem is called record linkage [16].The main challenges in record linkage include:• abbreviating all or part of the name while the database contains the full name(e.g., searching for uncertainty in AI while the database contains uncertaintyin artificial intelligence) or vice versa,• changing the order of the terms in the name (e.g., relational probabilisticmodels instead of probabilistic relational models),• entering only some (not all) terms in the name (e.g., graphical models insteadof probabilistic graphical models),• shortening a long term or person’s name (e.g., Bayes net instead of Bayesiannetworks or Mike instead of Michael),• spacing issues (e.g., searching for drop out while the database contains dropout),• terms that have more than one spellings (e.g., neighbor and neighbour),• typos.Record linkage is the task of recognizing records in two separate files whichrepresent identical persons, objects, or events. In order to complete this task acomparison should be made between the characteristics of two records (one fromeach file) and a decision should be made of whether these two records in the com-parison are referring to the same object or not [16, 46]. It has been previouslystudied independently by researchers in several areas by various names. As ob-ject identification [45], entity resolution [4], identity uncertainty [37], approximate1Chapter 1. Introductionmatching [21], duplicate detection [34], merge/purge [22], or hardening soft infor-mation [11]. Examples of record linkage include citation matching [19], personidentification in different Census datasets [46], and identifying different offers ofthe same product from multiple retailers for comparison shopping [5].Record linkage is similar to the short-text web search problem studied in theliterature [24, 25, 30, 43], where users search through documents containing shorttexts (e.g., considering only document titles). Short text search determines seman-tic similarity between two short texts and finds out if two pieces of short texts meanthe same thing. This can be used in many different areas in information retrievallike search [31], query suggestion [32], automatic summarization [1] and imagefinding [9]. The differentiating criterion, however, is that unlike record linkage, inshort-text search a query is to be matched to a document with the same meaning.For instance, if a query for a short-text search contains the term passion, the en-gine may score a document having the term passion and a document having theterm love (almost) the same. In record linkage, however, if the user searches for acompany name containing the term passion, it may not be sensible to score com-panies having the term passion and those having love the same. Furthermore, thelength of the query for short-text search is usually shorter than the length of thecorresponding document, whereas in record linkage there are no syntactic differ-ences between queries and record. Nevertheless, many techniques developed forshort-text search can be used for record linkage.The classic approaches for tackling a text search problem are variants of exactterm matching, i.e. matching exact terms in the query with those in records (ordocuments), weighting each term according to its frequency. These approaches area bag-of-words retrieval functions that ranks a set of documents based on the queryterm appearing in each document. Well-known variants of exact term matching in-clude TFIDF [42] and Okapi BM25 [40]. Different matching exact term algorithmsuse different functions for scoring documents. Due to the challenges mentionedearlier for record linkage and considering these algorithms only score documentsfor having an exact term of the query, these approaches fail on a portion of queries.Distance functions are also used in approximate matching of name and records.A distance function outputs a real number r for a pair of word s and t, wheresmaller r indicates greater similarity between s and t. SecondString Toolkit isan open-source Java toolkit focusing on distance functions [10] and supports alarge number of distance functions like Levenstein distance, which assigns a costto each edit operation, Monge-Elkan distance [34], Smith-Waterman distance func-tion [44], and Jaro metric [27] [26], which is widely used in record linkage area.Distance functions can also be used for record linkage. Using distance functions inrecord linkage does not work in many examples when the query search and the de-sired document appearance are different (e.g. searching for “UAI” while there are2Chapter 1. Introductiontwo documents “UBC” and “uncertainty in artificial intelligence”, distance func-tions like Levenstein distance, Monge-Elkan distance, and Jaro metric give “UBC”more score than “uncertainty in artificial intelligence” but “UAI” is abbreviation of“uncertainty in artificial intelligence”).Latent semantic models [12][23][6][14][38] aim at improving the exact-termmatching approach by converting the query and the document to a smaller spacecalled the semantic space and finding the similarities in the semantic space. So aquery and a document can have high similarity score even if they do not have anycommon words. One well-know latent semantic model in information retrieval islatent semantic analysis [12]. In this model documents and queries are mapped toa low dimensional semantic vector Dˆ = ATD, where A is projection matrix, usingsingular value decomposition (SVD) [20]. Terms that are mapped to the samedimension are assumed to have similar semantics. In document search, relevanceof a document for a query is computed as the cosine similarity score of dimensionalsemantic vector of document and query according to A, Dˆ and Qˆ:simA(Q,D) =QˆTDˆ∥∥Qˆ∥∥∥∥Dˆ∥∥ (1.1)The conversion aims at mapping terms with similar semantics to the same partof the semantic space. Popular models are grouped into two groups of linear pro-jection and generative topic models.Latent semantic models can be categorized into two groups based on their useof available data: supervised and unsupervised models. Since the evaluation metricfor the retrieval task (usually precision and recall) is in the labels and most exist-ing latent semantic models [12, 14, 23, 6, 41] are learned unsupervised, they oftenlearn models that are only loosely coupled to the evaluation metric for the retrievaltask, but supervised latent semantic models learn the semantical similarities be-tween terms (or between queries and documents) that are directly aligned with theevaluation metric of the retrieval task using a training set. The training set may beconstructed by having an expert manually selecting the appropriate document, orcome from a user clicking a document after searching for a query.In addition to latent semantic models, the translational approaches [17, 18] anddeep learning approaches [25, 36] use a supervised approach. Translation modelslearns a one word to one word translation between query words and documentswords. These models are trained using labeled datasets. Studies have shown thatusing large amount of labeled datasets, translation models can be very effective[17][18]. Deep learning approaches aim at mapping the query and the documents toa smaller space where the distance between the query and corresponding documentis minimized and distances between the query and other documents are maximized.3Chapter 1. IntroductionIn this paper, we take the supervised translational approach for the record link-age problem and design a probabilistic model which gives the probability of eachrecord in the database being the queried record. In many applications, a proba-bilistic approach may facilitate the decision making process for the owner of thesystem in terms of specifying an error tolerance and automating a portion of thequeries. We chose our probabilistic model to be a relational logistic regression(RLR) model [28] as it simplifies specifying and describing our model, can be ex-tended when the dataset contains more fields (See section 4.6), and empiricallycompares well to other similar models according to a recent comparative study[29]. The proposed method is a sub-linear record linkage approach because foreach query we only visit the documents that we believe have a chance to be thedesired document. We conducted experiments on a large real-world dataset from atelecommunications company.4Chapter 2BackgroundIn this chapter, we provide sufficient information for readers to read the rest of thethesis. We also describe the terminologies, algorithms, and models used in the restof the paper.2.1 TerminologyTo be consistent with other scientific papers on this field like Huang et al., we usethe following terminologies.A term is a sequence of letters and digits. A document is a sequence of terms.A corpus is a set of documents and corresponds to a database of records in recordlinkage. A query is also a document that we are interested in finding duplicateof. A positive labeled set is a set of 〈query,document〉 pairs where the documentis the duplicate of the query in the corpus. A negative labeled set is a set of〈query,document〉 pairs where the document is not the duplicate of the query.Example 1. “uncertainty”, “in”, “artificial”, and “intelligent” are instances ofterm. “AAAI”, “uncertainty in AI”, and “uncertainty in the world” are instancesof document. “UAI” and “association of advancement of artificial intelligence”may be instances of query. Set of {〈 “UAI”, “uncertainty in AI” 〉, 〈 “associationof advancement of artificial intelligence”, “AAAI” 〉} may be a positive labeledset. Set of {〈 “UAI”, “uncertainty in the world” 〉, 〈 “association of advancementof artificial intelligence”, “uncertainty in AI” 〉} may be a negative labeled set.A document corresponds to a record in record linkage, except a record mayhave several fields but a document contains one field which is its text. Out ofdifferent fields the most challenging ones in record linkage are text fields that needto be parsed and understood. For simplicity we will consider records with one textfield for now. We will extend our model to consider records with more than onefield in section 4.6.52.2. TFIDF2.2 TFIDFIn information retrieval, TFIDF [42] is an exact term matching search algorithmwhich intends to reflect the importance of terms to a document in a corpus. Nowa-days, TFIDF is one of the most popular term-weighting schemes. For instance,[3] found that 83% of text-based recommender systems in the domain of digitallibraries use TFIDF.Given a query Q, TFIDF score for each document D being the desired docu-ment is as follows:TFIDF(D,Q) = ΣT∈Q∧DTF(T,D)∗ IDF(T) (2.1)where:• TF(T, D): stands for the term frequency and measures how frequent term Tis in document D. TF(T,D) is usually computed by counting the number oftimes T appears in D.• IDF(T): stands for inverse of document frequency and measures how muchinformation the term provides, that is, whether the term is common or rareacross all documents in the corpus. IDF aims at scaling down the importanceof common terms and scaling up the importance of rare terms. Typically,the IDF score of a term T for a corpus is computed as log nDF(T) , where nis the number of all documents in the corpus and DF(T) is the number ofdocuments that have term T .Certain terms such as is, of, and are tend to have high TF scores because theyusually appear multiple times, but have little importance. So the reason to considerIDF score is to penalize such frequent terms.A high TF-IDF score is reached by a high term frequency and low documentfrequency of the term in the corpus. The IDF score tends to filter out commonterms. Since the ratio inside the logarithm function is always greater than or equalto 1, the value of IDF (and also TFIDF) is greater than or equal to 0. As a termappears in more documents, the ratio inside the logarithm approaches 1, bringingthe IDF and TFIDF score closer to 0.For record linkage, the TF part of the TFIDF is usually 1 for the terms in thedocument and can be ignored (e.g., rarely we see name of a company, person, orpaper having one word twice). Therefore, only the IDF of the terms that are incommon between a query and a document contribute to the score of the documentin record linkage.TFIDF only gives scores to the documents for having the exact terms as inquery. In many cases, however, the query may contain different terms than the62.3. Logistic Regressiondesired document due to abbreviations, term shortenings, using equivalent terms,typos, etc. In such cases, TFIDF often fails to produce good results.2.3 Logistic RegressionLogistic regression (LR) [2] is a popular classification method within machinelearning community. We describe how it can be used for classification following[8] and [33].Suppose we have a set of labeled examples {(x1,y1),(x2,y2), ...,(xm,ym)}, whereeach xi is composed of n features with values xi1,xi2, ...,xin and each yi is a binaryvariable. Logistic regression learns a set w = {w0,w1, ...,wn} of weights, wherew0 is the intercept and wj is the weight of the i-th feature of xi, xij. For simplic-ity we assume a new dimension xi0 has been added to the data to avoid treatingw0 differently than other wjs. LR defines the probability of yi being 1 given xi asfollows:P(yi = 1|xi,w) = σ(Σnj=0xijwj) (2.2)where σ(x) = 11+exp(−x) is the Sigmoid function.Logistic regression learns the weights by maximizing the log-likelihood of thedata (or equivalently, minimizing the logistic loss function) as follows:wLR = argmaxwΣmi=0log(P(yi = 1|xi)) (2.3)An L1 or L2-regularization can be added to the loss function. The L1-regularizationencourages automatic feature selection.2.4 Relational Logistic RegressionRelational logistic regression (RLR) [28] is the relational counterpart of logisticregression (LR), and the directed counterpart of Markov logic networks [13]. Firstwe start with some terminologies:A population refers to a set of objects. The size of a population is a non-negative number indicating its cardinality.A Logical variable (logvar) starts with lower-case letters, and constant de-noting objects starts with upper-case letters. Associated with a logical variable xis a population pop(x). A lower-case and an upper-case letter written in bold referto a set of logvars and a set of individuals respectively. We write a substitution asθ = 〈x1, ...,xk〉/〈t1, ..., tk〉 where each xi is a different logvar and each ti is a logvaror a constant in pop(xi).72.4. Relational Logistic RegressionA parametrized random variable (PRV) is of the form F(t1, ..., tk) where Fis a k-ary function symbol and each ti is a logvar or a constant. A grounding ofa PRV can be obtained by a substitution θ = 〈x1, ...,xk〉/〈X1, ...,Xk〉 mapping eachof its logvars xi to an object in pop(xi)A literal is a PRV or its negation. A formula is made up of literals connectedwith conjunction or disjunction. η(F,Π) is defined as the number of times formulaF is true according to Π. A weighted formula (WF) is a tuple 〈F,w〉 where F is aformula and w is a weight. Applying a substitution θ = 〈x1, ...,xk〉/〈t1, ..., tk〉 on aformula F (written as Fθ ) replaces each xi in F with ti.Let H(x) be a PRV whose values are to be predicted based on a set φ of PRVsnot including H. We call φ as the parents of H. Relational logistic regressiondefines a conditional probability distribution for H(x) given its parents φ using aset ψ of WFs only containing PRVs from φ as follows:Pψ(H(X) = True|Π) = σ( ∑〈F,w〉∈ψw∗η(Fθ ,Π)) (2.4)where Π is an assignment to all ground PRVs in parents of H. η(Fθ ,Π) is thenumber of instances of Fθ that are true w.r.t. Π. σ(z) = 11+exp(−z) is the Sigmoidfunction.Example 2. Consider we want to find probability of a person being happy and weknow that happiness has a relation with number of kind friends the person has suchthat the more kind friends the person has the happier he/she is. The model in Fig.2.1 shows our theory. In this model let Π be an assignment of values to Friend andKind. RLR sums over ψ = {WF1,WF2} resulting in:Pψ(Happy(X) = True |Π) = σ(−4.5+1∗η(Friend(y,X)∧Kind(y),Π)) (2.5)where η(Friend(y,X)∧Kind(y),Π) represents the number of objects in pop(y) forwhich Friend(y,X)∧Kind(y) is true according to Π, corresponding to the numberof friends of X that are kind. When this count is greater than or equal to 5, theprobability of X being happy is closer to one than zero; otherwise, the probabilityis closer to zero than one. Therefore, the two WFs model "someone is happy ifthey have at least 5 friends that are kind". Note that -4.5 and 1 are weights of twoformula that are learned.Example 3. Consider we want to find probability of a document being the resultof searching for a query. The relational model in Fig. 2.2 shows that probability ofa document being a result for a query depends on number of important words theyboth have. So we can use the following set ψ of WFs for defining the conditionalprobability of Result(q,d) given its parents:82.5. Continuous PRVs Friend(y,x) Kind(y) Happy(x) 𝑊𝐹1: < 𝑇𝑟𝑢𝑒,−4.5 > 𝑊𝐹2:< 𝐹𝑟𝑖𝑒𝑛𝑑(𝑦, 𝑥) ∧ 𝐾𝑖𝑛𝑑(𝑦), 1 > Figure 2.1: An RLR model taken from [28].〈True,−4〉 〈Has(d, t)∧Has(q, t)∧ Important(t),1.5〉where Has(d, t) is true if term t is in document d, and Important(t) is true if t isimportant.Let Π be an assignment of values to Has and Important. RLR sums over theWFs in ψ resulting in:Pψ(Result(Q,D) = True |Π) =σ(−4.0+1.5∗η(Has(D, t)∧Has(Q, t)∧ Important(t),Π))(2.6)where η(Has(D, t)∧Has(Q, t)∧ Important(t),Π) represents the number of objectsin pop(t) for which: Has(D, t)∧Has(Q, t)∧ Important(t) is true according to Π,corresponding to the number of terms that are both in Q and D and are important.When this count is greater than or equal to 3, the probability of D being the resultof Q is closer to one than zero; otherwise, the probability is closer to zero than one.Therefore, the two WFs model "a document is a result of a query if they share atleast 3 important terms".2.5 Continuous PRVsDuring learning, Has may get Boolean values and Important may get real values.In order to incorporate real values in the RLR model, we can add continuous unaryPRVs. In this case Important does not need to reflect a Boolean value of whethera word is important or not but can reflect a number as a measure of importance ofthe world.Following [28], w.l.o.g. we assume WFs have no disjunctions (we can rep-resent any weighted formula with disjunction in another format without disjunc-tion as well: 〈A(x)∨B(x),w〉 can be represented as 〈A(x),w〉,〈¬A(x)∧B(x),w〉).We adopt the extension proposed in [15] for handling continuous PRVs in RLR.We replace True with 1, False with 0, and ∧ with *. This allows continuous92.5. Continuous PRVs Has(d, t) Has(q, t) Important(t) Result(q, d) Figure 2.2: An RLR model for Result(q, d)PRVs in WFs. For example if we have A(X) = False, B(X) = True, and C(X) =1.2, then 〈a(X) ∗ b(X),w〉 evaluates to 0, 〈¬a(X) ∗ b(X),w〉 evaluates to w, and〈¬a(X)∗b(X)∗ c(X),w〉 evaluates to 1.2 * w.10Chapter 3A Model of Record LinkageIn this chapter we provide information on how we develop our record linkagemodel, what problem of record linkage each part of the model solves, and howall different parts communicate to each other.Let t, q, and d be three logvars corresponding respectively to the terms, queries,and documents, Has(d, t) be a Boolean PRV indicating whether document d has theterm t or not and is observed for all documents and terms, IDF(t) be an observedreal valued PRV indicating the IDF score for terms, and Result(q,d) is True whend is result of q.3.1 A probabilistic TFIDF based modelWe design an RLR model which uses TFIDF to assign probabilities to each docu-ment being the desired document for a query. A possible RLR model may definethe conditional probability of Result(d,q) using the following WFs:〈True,w0〉〈Has(q, t)∗Has(d, t)∗ IDF(t),w1〉When both instances of “Has” are true, it contributes the weight IDF(t)∗w1.RLR sums over the above WFs resulting in:P(Result(Q,D) |Π) = σ(w0+w1 ∗ ∑T∈Q∧DIDF(T)) (3.1)where Q and D are respectively a specific query and document. Having a positivelabeled set and a negative labeled set, the weights w0 and w1 can be learned usinggradient descent.Note that −IDF(T) = −log nDF(T) = log DF(T)n can be viewed as the log of theprobability of term T appearing in a random document. Taking the Sigmoid of thesum of these probabilities (which sigmoid is a part of RLR and sum of probabilitiesis a part of TFIDF) is sensible for calculating P(Result(Q,D)):∑T∈Q∧DIDF(T) = ∑T∈Q∧DlognDF(T)(3.2)113.1. A probabilistic TFIDF based model− ∑T∈Q∧DIDF(T) = ∑T∈Q∧DlogDF(T)n= logΠT∈Q∧DnDF(T)(3.3)σ(w0+w1 ∗ ∑T∈Q∧DIDF(T)) =11+ exp−w0−w1∗∑T∈Q∧D IDF(T)(3.4)σ(w0+w1 ∗ ∑T∈Q∧DIDF(T)) =11+ exp−w0 ∗ exp−w1∗∑T∈Q∧D IDF(T) (3.5)Using equation 3.3 in equation 3.5 and exp−w0 = c we get:σ(w0+w1 ∗ ∑T∈Q∧DIDF(T)) =11+ c∗ explog(ΠT∈Q∧D DF(T)n )w1(3.6)We can simplify equation 3.6 as following:σ(w0+w1 ∗ ∑T∈Q∧DIDF(T)) =11+ c∗ (ΠT∈Q∧D DF(T)n )w1(3.7)Considering DF(T)n as probability of the term T appearing in a document,ΠT∈Q∧D DF(T)n can be viewed as probability of terms T ∈ Q∧D appearing in adocument assuming all terms occur independent of each other. (ΠT∈Q∧D DF(T)n )w1also is the probability of terms T ∈ Q∧D appearing in a document without theindependency assumption.One issue with the above model is that in the process of learning the weightsw0 and w1 for a query Q containing only few words (Q∧D set will be small andcauses a few IDF scores to sum up), or containing only common words (commonwords have a small IDF score), ∑T∈Q∧D IDF(T) may be much lower than mostof the cases. This problem causes ∑T∈Q∧D IDF(T) not to be high enough evenfor the query Q and its actual desired document D. This issue causes the weightw1 to get a higher value to output a high probability for these cases in learningP(Result(Q,D)). To address this issue we update the WFs to normalize the sum ofthe IDF scores by dividing it by the maximum IDF score a document can get forthis query, which is:SumIDF(Q) = ∑T∈Q IDF(T)Note that the document that can get the maximum score for query Q has allterms of Q. So we build our RLR model with the following two WFs:123.2. Adding translations〈True,w0〉〈Has(q, t)∗Has(d, t)∗ IDF(t)∗ InvSumIDF(q),w1〉 (3.8)where InvSumIDF is a PRV whose value for a query Q is the inverse of the sumof the IDF scores of the terms in query:InvSumIDF(Q) =1∑T∈Q IDF(T)Having the above WFs, IDF scores will be normalized for all query and docu-ment pairs.Example 4. Consider a query Q = [T1,T2,T3,T4], the document that can get themaximum score for Q has all terms T1, T2, T3, and T4. SumIDF(Q) in this queryis:IDF(T1)+ IDF(T2)+ IDF(T3)+ IDF(T4)and InvSumIDF(Q) is as:1IDF(T1)+IDF(T2)+IDF(T3)+IDF(T4)Example 5. Consider a query Q= [T1,T2,T3,T4] and a document D= [T1,T2,T5,T6].Then:P(Result(Q,D)) = σ(w0+w1∗IDF(T1)+ IDF(T2)IDF(T1)+ IDF(T2)+ IDF(T3)+ IDF(T4))(3.9)In which IDF(T1)+ IDF(T2) is the sum of the IDF score that D gets for query Qand IDF(T1)+ IDF(T2)+ IDF(T3)+ IDF(T4) is the maximum IDF score a docu-ment can get for query Q and normalizes the IDF(T1)+ IDF(T2).3.2 Adding translationsAs mentioned earlier, a TFIDF based method may fail on documents that use dif-ferent terms than those in the query. Let the PRV Tr(t, t′), where t and t′ are twologvars with population of the set of all terms, represent the probability of a termT ∈ pop(t) be a translation (or a part of a translation) of term T ′ ∈ pop(t′). We willexplain how such a PRV can be constructed and learned using positive and negativelabeled sets in later sections. We assume translation relation is symmetric (if t is atranslation or a part of a translation of t′, then t′ is also a translation or a part of atranslation of t) so Tr(t, t′) is symmetric too (Tr(t, t′) = Tr(t′, t)). Having this PRV,we can add the following WF to our model:133.2. Adding translations〈Has(q, t)∗Has(d, t′)∗¬Has(q, t′)∗¬Has(d, t)∗Tr(t, t′)∗ IDF(t)∗InvSumIDF(q),w1〉This WF refers to pairs of terms 〈T,T ′〉 such that T ∈ Q−D and T ′ ∈ D−Q andT ′ is a part of translation for T with probability Tr(T,T ′), and gives Tr(T,T ′) ∗IDF(T) score to the document. Note that the reason why we consider the Ts inQ that are not in D (and similarly for T ′) is that if T is also in D, then we havealready given D the IDF(T) score. This WF is only adding to the IDF score of thedocument for having non-exact terms that their translations are in the query and hasthe same scale as the second WF. So we give it the same weight w1, and normalizeit similarly.Example 6. In Example 5, with the same query Q and document D supposeTr(T3,T5) = 0.9. Then the numerator of fraction in Eq. 3.9 will be summed with0.9 ∗ IDF(T3). Note that if Tr(T1,T6) = 0.8, we do not add 0.8 ∗ IDF(T1) to thescore as the document contains T1 as well and we have already given IDF(T1)score to the document.For some Ti ∈ Q−D and Tj ∈ D−Q we have Tr(Ti,Tj) > 0, Ti’s translationmay contain multiple words and Tj may be only one term in the translation ofTi. As an example if Q is “UAI”, D is “uncertainty in artificial intelligence” andTr(“UAI′′,“uncertainty′′)> 0, “uncertainty” is only one term of the translation for“UAI”; the other terms are “in”, “artificial”, and “intelligence”.Example 7. Suppose the query is “UAI”, and there are two documents d1 = “UAI”and d2 = “uncertainty in AI” in the corpus and we have Tr(“UAI”, “uncertainty”) =0.6 and Tr(“UAI”, “AI”) = 0.7. Although d1 is exactly as the query, d2 gets morescore than d1. d2 getting more score than d1 is not sensible because the maximumscore should belongs to d1 which is exactly as the query.The IDF formulation makes the strong assumption that each term appears ina document independently of the other terms [39]. Therefore, our current WFswill give scores to D for all its terms independently. This is, however, not intuitiveas the terms in D in such cases are highly dependent: a document containing theterms “uncertainty”, “in”, and “artificial” is much more likely to have the term“intelligence” than a random document.To address this issue, when we learn the values of the Tr PRV, we also computefor each term T the MaxTr(T) as maxD∈d∑T ′∈D 1Tr(T,T ′)>0 where 1Tr(T,T ′)>0 is 1 ifTr(T,T ′)> 0 and 0 otherwise. This number corresponds to the maximum numberof terms T ′ in a document for which Tr(T,T ′) > 0. Assuming InvMaxTr(t) is aPRV whose value for each T ∈ t is 1MaxTr(T) , we update the WF we added to the143.3. Learning Tr(t, t′)model in this subsection as follows:〈Has(q, t)∗Has(d, t′)∗¬Has(q, t′)∗¬Has(d, t)∗Tr(t, t′)∗ InvMaxTr(t)∗IDF(t)∗ InvSumIDF(q),w1〉(3.10)Example 8. In a large dataset we may get: MaxTr(“UAI”) = 2 (for uncertaintyand AI), MaxTr(“association”) = 1 (for assoc), and MaxTr(“nips”) = 4 (for neural,information, processing, and systems).Example 9. In Example 5 with the same query and document, suppose:Tr(T3,T5) = 0.9, Tr(T3,T6) = 0.8, and InvMaxTr(T3) = 13 .Then the numerator of fraction in Eq. 3.9 will be summed with 0.93 ∗ IDF(T3)+0.83 ∗ IDF(T3). The denominator is same as before because it should represent themaximum score a document can get for this query for normalization and remainunchanged with this extra information about translation pairs.3.3 Learning Tr(t, t′)We learn and populate the probabilities for Tr for each pair 〈T,T ′〉 of terms usinga heuristic which automatically assigns zero probabilities to many pairs. To defineour heuristic, we first need to provide some intuition.Let PLS be a positive labeled set and 〈Q,D〉 be a query, document pair in PSL.For two terms T and T ′, if T ∈Q−D, T ′ ∈D−Q, or T ′ ∈Q−D, T ∈D−Q then Tmight be a term in the translation of T ′ or vice versa. Therefore, such occurrencesare positive events regarding Tr(T,T ′). Now consider the case where T ∈Q, T ∈Dand T ′ ∈D. This case reduces the possibility of T ′ being a term in the translation ofT as T also appears in D. Therefore, such occurrences are negative events regardingTr(T,T ′). A natural way to compute Tr(T,T ′) is by dividing the number of positiveevents by the sum of the number of positive and negative events.Let:• Match(T,T ′)=∑〈Q,D〉∈PLS Has(Q,T)∗Has(D,T ′)∗¬Has(Q,T ′)∗¬Has(D,T)• Seen(T,T ′) = ∑〈Q,D〉∈PLS Has(Q,T)∗Has(D,T ′).Then we let Tr(T,T ′)= Match(T,T′)+c1Seen(T,T ′)+c , where c1 and c are pseudo-counts. Pseudo-counts are the priors for beta distribution. These are learned by cross-validation.By adding pseudo-counts, we impose a prior that a pair of terms are less likely tobe part of the translation of each other, unless we see them match multiple times.The pseudo-counts allow small amounts of data to have some influence, but notmuch, whereas large amounts of data can overwhelm the prior.We learned Tr on a positive labeled set but there could be information that helpsin learning Tr on a negative labeled set too that we have not investigated.153.4. Adding bigrams3.4 Adding bigramsOne of the important challenges in record linkage is the spacing issue: the querymay contain a space between two words where the document does not (e.g., search-ing for “drop out” where the document contains “dropout”) or vice versa. In orderto handle such cases, we use the bigrams of the query and documents where abigram is a concatenation of two consecutive terms in the query or documents.There are two cases that need to be considered: 1- the query contains a bigramof the desired document, 2- the desired document contains a bigram of the actualquery. We describe how to handle each case separately.3.4.1 Query contains the bigramConsider a document D= [T1,T2, . . . ,Tn]. If a query contained the term concat(T1,T2),then T1 and T2 are parts of the translation of concat(T1,T2). As an off-line processwhich will be done only once, before the queries arrive, we go through all docu-ments and for each document D= [T1,T2, . . . ,Tn] we set Tr(concat(Ti,Ti+1),Ti) =1 and Tr(concat(Ti,Ti+1),Ti+1) = 1 for i : 1 ≤ i ≤ n−1 and also for the reverses.This allows us to recognize in a document the two elements of the bigram appear-ing in the query.Example 10. Suppose in the corpus C there is a document D as “drop out”. Inthe process explained in 3.4.1 we set Tr(dropout,drop) = 1, Tr(dropout,out) = 1,Tr(drop,dropout) = 1, and Tr(out,dropout) = 1. So if we search for a query Q= “dropout”, the translation pairs helps us to find the components of the bigram“dropout”, which are “drop” and “out”.3.4.2 Document contains the bigramSuppose there is a document D and suppose Ti, Tj are two consecutive terms in thequery Q, and neither of Ti and Tj appear in the document. Then we should also lookfor the concatenation of these two terms, concat(Ti,Tj), in the document. In orderfor our WFs to remain unaffected, we assume D also has Ti and Tj: Has(D,Ti) =true and Has(D,Tj) = True.Example 11. Suppose we want to search for Q = “drop out”. The document D =“dropout” will get the score of a document that has both terms drop and out forhaving the bigram dropout.Example 12. Suppose in the corpus C we have documents D1 = “drop out” andD2 = “dropout”. We want to search for Q1 = “drop out” and Q2 = “dropout”, while163.5. Implementationwe know InvMaxTr(dropout) = 0.5:P(Result(Q1,D1)) =σ(w0+w1 ∗ IDF(drop)+ IDF(out)IDF(drop)+ IDF(out))=σ(w0+w1)(3.11)P(Result(Q2,D1)) =σ(w0+w1 ∗ IDF(dropout)∗0.5+ IDF(dropout)∗0.5IDF(dropout) )=σ(w0+w1)(3.12)P(Result(Q1,D2)) =σ(w0+w1 ∗ IDF(drop)+ IDF(out)IDF(drop)+ IDF(out))=σ(w0+w1)(3.13)P(Result(Q2,D2)) =σ(w0+w1 ∗ IDF(dropout)IDF(dropout))=σ(w0+w1)(3.14)3.5 ImplementationIn order to implement our model efficiently, we use the following data structures.For the 3.8 WF, given a query Q, we need to score documents D that have at leasta term in common with Q. We use an inverted index which is a hash map fromterms to the documents that contain that term. Then for each T ∈Q, we can accessthe documents having T easily and update their score.For the 3.10 WF, the Tr matrix can be very large, and so needs to be imple-mented efficiently. We store a TrSet hash-map from terms to the set of terms thatmay be in their translation together with the corresponding probabilities. Then foreach term T ∈Q, we retrieve all documents not containing T but containing a termT ′ ∈ TrSet(T) and update their score according to the probabilities.17Chapter 4Empirical Results4.1 DatasetWe tested our model on a real dataset. This dataset contains a list of businessnames of the customers of a telecommunications company. The telecommunica-tions company offers a service which requires searching a business name enteredby a customer in the dataset. In this application, the list of customer names isthe corpus, each stored customer name is a document, and any searched customername is a query.The dataset contains approximately 650000 customer names (or documents).Each month 4K different customers (queries) are searched. Currently, except forsome obvious cases, the final document for a query is found or endorsed by anexpert, providing a large positive labeled set with approximately 1.6 million pairs.We used the pairs in our positive labeled set corresponding to queries until a certainpoint in time as training data and the queries in the next two months (almost 8Kqueries) as test data. This makes the task an extrapolation task (predicting thefuture) that should be more challenging than interpolation. For the queries in ourtraining set, we also create a negative labeled set for learning the weights of ourmodel by pairing a query with 5 documents other than the desired document. Notethat output probabilities of the model will change if we use a number other than 5but the ranking will be the same. We choose 5 negative examples for each queryfollowing Huang et al. [25].In this dataset only 58% of the test set are exact matches. Current approximatestring matching and basic heuristics are based on distance functions. In our appli-cation, 19% of the queries in the test set have an edit distance of more than 10 withtheir final document. So using a method that gives us all documents that have anedit distance less than 10 with the query even with a perfect ranker (a perfect rankerhas accuracy 100% in ranking) to rank the retrieved documents for the given querythe best we can get is 81% while our proposed approach gets to more than 97%.Many of the correct matches are very different than the searched query as strings(e.g., as in the “UAI” vs “uncertainty in AI” examples given) which indicates thatmethods that rely on only distance functions will not work well. Another reasonwhy methods which are based on distance functions do not work well is that they184.2. PreprocessingTable 4.1: Some pairs of terms learned by our heuristic for Tr(t, t′): each two termsin a columns shows a learned pair by our heuristicAssociation Service Centre ConsultingAssoc Svc Center ConsultinAssn Srvc Centr ConsulAssociate Srvs Cntr ConsltngAsso Srv Cntre consltdo not consider the importance of words. As an example consider searching forthe query “UAI association”, where the database contains two records 1- “UAI”,and 2- “NIPS association”. In this case, the second record has a lower edit distancethan the first record, while a document having “UAI” is a better document thana document having “association” for this query. Ignoring the importance of thewords misleads us towards selecting the second record.4.2 PreprocessingWe do the following simple preprocessing steps on the data:• uppercasing letters• deleting annotations and punctuations• singularizing plural termsWe did not stem the terms and did not remove stop-words. We let the model learnthese by itself.4.3 ExperimentsIn order to learn the translation PRV Tr(t, t′), we found on a validation set thatpseudo-counts c1 = 1 and c = 5 give good results. To sparsify the PRV and makeits size manageable, we only keep the pairs of terms whose probability was atleast 0.7, which 0.7 is also selected on a validation-set. This provided us withapproximately 10K pairs of terms. Table 4.1 represents some pairs of terms learnedby our heuristic.We also tried the heuristic proposed in [17] which in our formulation can bewritten as follows:Tr(T,T ′) =Seen(T,T ′)QF(T ′)(4.1)194.3. Experiments 0.910.920.930.940.950.960.970.980.9911 7131925313743495561677379859197hit@kkTFIDFTFIDF+TRTFIDF+TR+BGFigure 4.1: hit@k of TFIDF, TFIDF+TR, and TFIDF+TR+BG.where QF(T ′) corresponds to number of queries in the positive labeled set that hasterm T ′. Accepting only the pairs with at least 0.7 probability, this heuristic pro-vided approximately 500K pairs of terms which severely slowed down the searchengine. Furthermore, we found that the pairs generated using this heuristic are notsuited for our task, and found the main reason to be the use of Seen(T,T ′) in thenumerator. We show this using the following example.Example 13. Let Q= [T1,T2,T3] and the corresponding document D= [T1,T2,T4]be a pair of 〈Q,D〉 in the positive labeled set. According to the heuristic proposedin [17], this 〈Q,D〉 pair supports Tr(T1,T2) and Tr(T1,T4), i.e. T1 being part ofthe translation for T2 and T4. However, since T1 also appears in D, we claimthat it should not support Tr(T1,T2) and Tr(T1,T4) and also it should lower theseprobabilities as if T2 and T4 part of the translation of T1, then T1 would not haveappeared in D.Note that we selected the probability threshold of 0.7 to be consistent with theprobability threshold we have for the Tr probabilities in our model. Furthermore,looking at the learned pairs with this heuristic most of them have a value greaterthan 0.9 so even using a larger threshold will not solve the problem of slowingdown the machine.Consider the effect of adding translations and bigrams to our probabilistic204.4. Automating the process of searching for a queryTFIDF based model. Let TFIDF+TR be the TFIDF model after adding the transla-tions, and TFIDF+TR+BG be the TFIDF+TR model after adding the bigrams. Welearn the weights of our model using the positive and negative labeled sets. Webreak the ties by picking the document that has fewer terms.For each query in our test set, we score the documents using these methods andpick the top k. For a specific value of k, we say a method covers a query if thedesired document appears in the top k results. We define hit@k as the percentageof test queries that the method covers following Bordes et al. [7] and Nguyen [35].Figure 4.1 demonstrates the hit@k of these methods for different values of k. Asit can be viewed from the diagram, both translations and bigrams have a positiveeffect on the performance of the model in terms of hit@k.For some value of k, let C1 =CTFIDF, C2 =CTFIDF+TR and C3 =CTFIDF+TR+BGbe the set of all test queries that are covered using TFIDF, TFIDF-TR, and TFIDF-TR-BG respectively. As k grows, these sets either do not change or grow in size.However, the difference between these sets may differ for different values of k.Figure 4.2 (a) represents the set differences between C1 and C2, Figure 4.2(b)represents the set difference between C1 and C3, and Figure 4.2(c) represents theset difference between C2 and C3. It can be viewed that for all values of k, |C2−C1|is always bigger than |C1−C2| and similarly for C3 indicating that adding transla-tions and bigrams always helps improve the hit@k. Note that as k becomes larger,the set differences |C2−C1| and |C3−C1| become larger and the set differences|C1 −C2| and |C1 −C3| become quite small (5-10 queries out of the 8000 testqueries) and it shows that our proposed methods covers almost everything thatTFIDF covers. That is because for queries with different terms than the actual doc-ument, TFIDF cannot even find the desired document for a large k, whereas theother two methods may be able to do so.4.4 Automating the process of searching for a queryGiven that our model outputs probabilities for different documents being the de-sired document for a query, we can associate it with a utility function. One possibleutility function to be combined with this model is to set a trust threshold tt such thatif the top output of the model has a probability of more than tt, then it is consideredas the desired document without having an expert examine it, i.e. automating theprocess for a percentage of the queries.There are two criteria that are important for picking the threshold tt:• precision: if we trust the top outputs of a model that pass the threshold, whatpercentage of them will be the actual desired documents, which is hit@1measure of those passing the threshold.214.4. Automating the process of searching for a query 0510152025303540451 7131925313743495561677379859197Countk|C2-C1||C1-C2|01020304050607080901 7131925313743495561677379859197Countk|C3-C1||C1-C3|(a) (b) 01020304050601 6111621263136414651566166717681869196Countk|C3-C2||C2-C3|(c) Figure 4.2: Set differences for the hit@k of (a) TFIDF (C1) and TFIDF+TR (C2),(b) TFIDF and TFIDF+TR+BG (C3), and (c) TFIDF+TR and TFIDF+TR+BG.224.5. Thresholding IDF scores 0.950.9550.960.9650.970.9750.980.9850.9975 80 85 90 95precisionttTFIDFTFIDF+TRTFIDF+TR+BGFigure 4.3: hit@1 of the three methods TFIDF, TFIDF+TR, and TFIDF+TR+BGfor different trust thresholds tt.• automation percentage: if we trust the top outputs of a model that pass thethreshold, what percentage of queries will be automatically matched to adocument, without having an expert see them.Figure 4.3 represents the precision of the three models for different trust thresh-olds. It can be seen that the diagrams for the three models overlap and they all havesimilar performances. It can be also seen that all three models have high precisionwhen the threshold is set high enough. Figure 4.4 shows the automation percentagevs. trust threshold. It can be seen that both translations and bigrams increase theautomation percentage for all trust thresholds we tried.4.5 Thresholding IDF scoresIn order to speed-up the process of searching for a query (doing inference to find thedesired document for the query), we tested the TFIDF model with cut-off thresholdvalues for IDF score of terms. The motivation behind considering a threshold forIDF score is that very frequent terms do not contain much information but theyappear in many documents and considering them causes a huge difference on timeof searching (in our proposed method considering more terms means having more234.5. Thresholding IDF scores 0.750.770.790.810.830.850.870.8975 80 85 90 95automation percentagettTFIDFTFIDF+TRTFIDF+TR+BGFigure 4.4: automation percentage given the trust threshold tt for the three methodsTFIDF, TFIDF+TR, and TFIDF+TR+BG.operations to do). So ignoring some of terms with low IDF scores (equivalent toignoring some of terms with high frequency), may not cause much informationloss but can speed up the process of search. Figure 4.5 represents hit@k of TFIDFmodel with different cut-off threshold values for different values of k (for eachquery in our test set, we score the documents using methods explained and pick thetop k.). This figure shows ignoring very frequent words causes information loss,but it does not affect the hit@k much. Figure 4.6 demonstrates average time thattakes for the algorithm explained in 3.1 with a cut-off IDF score threshold to searchfor a record. This figure shows there is huge time difference between consideringall words (cut-off threshold = 0) and ignoring a few words (cut-off threshold = 1)while their hit@k values do not differ much. There is a trade-off between averagetime that takes for the algorithm explained in 3.1 with a specific cut-off thresholdto search for a query and its hit@k.244.5. Thresholding IDF scores 0.90.910.920.930.940.950.960.970.980.9911 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97hit@kkThreshold = 0Threshold = 1.0Threshold = 1.5Threshold = 2.0Threshold = 2.5Threshold = 3.0Threshold = 3.5Figure 4.5: hit@k of TFIDF model with different cut-off thresholds of IDF scoreof words for different values of k. 00.050.10.150.20.250 1 1.5 2 2.5 3 3.5average timeIDF thresholdFigure 4.6: Average time to search a query in TFIDF with different IDF thresholds.254.6. Extending the model for documents with more than one field 0.860.880.90.920.940.960.9811 7131925313743495561677379859197hit@kkTFIDF+TRTFIDF+TR+PCFigure 4.7: hit@k of TFIDF+TR and TFIDF+TR+PC.4.6 Extending the model for documents with more thanone fieldWe can extend our proposed model for applications with more information abouteach document, e.g., instead of a document only represented by its name, it rep-resented by two fields of name and address. In this case, We add more WFs toinclude information of all fields.In our application, we have name and postal code of companies. We add thefollowing WF to include information in the postal code field:〈SamePostalCode(d,q),w2〉In which SamePostalCode(d,q) be a Boolean PRV indicating whether documentd has the same postal code as query q or not. We have implemented and testedthis model. Let TFIDF+TR be the TFIDF model after adding the translations, andTFIDF+TR+PC be the TFIDF+TR model after adding the postal codes. Figure 4.7demonstrates the hit@k of these two methods for different values of k. As it canbe viewed from the diagram, considering postal codes has a positive effect on theperformance of the model in terms of hit@k.26Chapter 5ConclusionIn this thesis we are trying to solve a record linkage problem, the problem of findinga new name in a database of records indexed by names (e.g., of companies orrestaurants). There is a close relationship between record linkage and short-textsearch for web, but there are also differences that should be taken into account.We developed a probabilistic model for record linkage using relational logisticregression. We started with a model which relied solely on TFIDF score, thenwe added the possibility of recognizing two terms that are not identical but maybe part of the translation of each other and also addressed the spacing issues byconsidering the bigrams of terms in queries and records. We tested our modelson a large dataset from a telecommunications company and we showed that bothtranslations and bigrams improve the search performance and quality.In future, we intend to consider the translation probabilities learned using ourheuristic as an initialization and then update them by gradient descent using avail-able positive and negative labeled sets. Another possibility for future work is totest whether the translation probabilities learned for one task and dataset can betransferred to a different task with a different dataset.27Bibliography[1] Ramiz M Aliguliyev. A new sentence similarity measure and sentence basedextractive technique for automatic text summarization. Expert Systems withApplications, 36(4):7764–7772, 2009.[2] Paul D Allison. Logistic regression using SAS: Theory and application. SASInstitute, 2012.[3] Joeran Beel, Bela Gipp, Stefan Langer, and Corinna Breitinger. Research-paper recommender systems: a literature survey. International Journal onDigital Libraries, 17(4):305–338, 2016.[4] Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su,Steven Euijong Whang, and Jennifer Widom. Swoosh: a generic approachto entity resolution. The VLDB JournalThe International Journal on VeryLarge Data Bases, 18(1):255–276, 2009.[5] Mikhail Bilenko, S Basil, and Mehran Sahami. Adaptive product normal-ization: Using online learning for record linkage in comparison shopping.In Data Mining, Fifth IEEE International Conference on, pages 8–pp. IEEE,2005.[6] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet alloca-tion. Journal of machine Learning research, 3(Jan):993–1022, 2003.[7] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, andOksana Yakhnenko. Translating embeddings for modeling multi-relationaldata. In Advances in neural information processing systems, pages 2787–2795, 2013.[8] Saskia Le Cessie and J.C. van Houwelingen. Ridge estimators in logisticregression. Applied Statistics, 41(1):191–201, 1992.[9] Tatiana AS Coelho, Pa´vel Pereira Calado, Lamarque Vieira Souza, BerthierRibeiro-Neto, and Richard Muntz. Image retrieval using multiple evidenceranking. IEEE Transactions on Knowledge and Data Engineering, 16(4):408–417, 2004.28Bibliography[10] William Cohen, Pradeep Ravikumar, and Stephen Fienberg. A comparisonof string metrics for matching names and records. In Kdd workshop on datacleaning and object consolidation, volume 3, pages 73–78, 2003.[11] William W Cohen, Henry Kautz, and David McAllester. Hardening soft in-formation sources. In Proceedings of the sixth ACM SIGKDD internationalconference on Knowledge discovery and data mining, pages 255–259. ACM,2000.[12] Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer,and Richard Harshman. Indexing by latent semantic analysis. Journal of theAmerican society for information science, 41(6):391, 1990.[13] Pedro Domingos and Daniel Lowd. Markov logic: An interface layer for ar-tificial intelligence. Synthesis Lectures on Artificial Intelligence and MachineLearning, 3(1):1–155, 2009.[14] Susan Dumais, Thomas K Landauer, and Michael L Littman. Automaticcross-linguistic information retrieval using latent semantic indexing. 1997.[15] Bahare Fatemi, Seyed Mehran Kazemi, and David Poole. A learning algo-rithm for relational logistic regression: Preliminary results. arXiv preprintarXiv:1606.08531, 2016.[16] Ivan P Fellegi and Alan B Sunter. A theory for record linkage. Journal of theAmerican Statistical Association, 64(328):1183–1210, 1969.[17] Jianfeng Gao, Xiaodong He, and Jian-Yun Nie. Clickthrough-based transla-tion models for web search: from word models to phrase models. In Proceed-ings of the 19th ACM international conference on Information and knowledgemanagement, pages 1139–1148. ACM, 2010.[18] Jianfeng Gao, Kristina Toutanova, and Wen-tau Yih. Clickthrough-based la-tent semantic models for web search. In Proceedings of the 34th interna-tional ACM SIGIR conference on Research and development in InformationRetrieval, pages 675–684. ACM, 2011.[19] C Lee Giles, Kurt D Bollacker, and Steve Lawrence. Citeseer: An automaticcitation indexing system. In Proceedings of the third ACM conference onDigital libraries, pages 89–98. ACM, 1998.[20] Gene H Golub and Christian Reinsch. Singular value decomposition and leastsquares solutions. Numerische mathematik, 14(5):403–420, 1970.29Bibliography[21] Luis Gravano, Panagiotis G Ipeirotis, Hosagrahar Visvesvaraya Jagadish,Nick Koudas, Shanmugauelayut Muthukrishnan, Divesh Srivastava, et al. Ap-proximate string joins in a database (almost) for free. In VLDB, volume 1,pages 491–500, 2001.[22] Mauricio A Herna´ndez and Salvatore J Stolfo. The merge/purge problem forlarge databases. In ACM Sigmod Record, volume 24, pages 127–138. ACM,1995.[23] Thomas Hofmann. Probabilistic latent semantic indexing. In Proceedingsof the 22nd annual international ACM SIGIR conference on Research anddevelopment in information retrieval, pages 50–57. ACM, 1999.[24] Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. Convolutional neu-ral network architectures for matching natural language sentences. In Ad-vances in neural information processing systems, pages 2042–2050, 2014.[25] Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and LarryHeck. Learning deep structured semantic models for web search using click-through data. In Proceedings of the 22nd ACM international conferenceon Conference on information & knowledge management, pages 2333–2338.ACM, 2013.[26] Matthew A Jaro. Advances in record-linkage methodology as applied tomatching the 1985 census of tampa, florida. Journal of the American Sta-tistical Association, 84(406):414–420, 1989.[27] Matthew A Jaro. Probabilistic linkage of large public health data files. Statis-tics in medicine, 14(5-7):491–498, 1995.[28] Seyed Mehran Kazemi, David Buchman, Kristian Kersting, Sriraam Natara-jan, and David Poole. Relational logistic regression. In KR, 2014.[29] Seyed Mehran Kazemi, Bahare Fatemi, Alexandra Kim, Zilun Peng,Moumita Roy Tora, Xing Zeng, Matthew Dirks, and David Poole. Com-paring aggregators for relational probabilistic models. arXiv preprintarXiv:1707.07785, 2017.[30] Tom Kenter and Maarten de Rijke. Short text similarity with word embed-dings. In Proceedings of the 24th ACM International on Conference on In-formation and Knowledge Management, pages 1411–1420. ACM, 2015.[31] Hang Li, Jun Xu, et al. Semantic matching in search. Foundations andTrends R© in Information Retrieval, 7(5):343–469, 2014.30Bibliography[32] Donald Metzler, Susan Dumais, and Christopher Meek. Similarity measuresfor short segments of text. In European Conference on Information Retrieval,pages 16–27. Springer, 2007.[33] Tom Mitchell. Machine Learning. McGraw Hill, 1997.[34] Alvaro Monge and Charles Elkan. An efficient domain-independent algo-rithm for detecting approximately duplicate database records. 1997.[35] Dat Quoc Nguyen. An overview of embedding models of entities and rela-tionships for knowledge base completion. arXiv preprint arXiv:1703.08098,2017.[36] Hamid Palangi, Li Deng, Yelong Shen, Jianfeng Gao, Xiaodong He, JianshuChen, Xinying Song, and Rabab Ward. Deep sentence embedding using longshort-term memory networks: Analysis and application to information re-trieval. IEEE/ACM Transactions on Audio, Speech and Language Processing(TASLP), 24(4):694–707, 2016.[37] Hanna Pasula, Bhaskara Marthi, Brian Milch, Stuart J Russell, and Ilya Sh-pitser. Identity uncertainty and citation matching. In Advances in neuralinformation processing systems, pages 1425–1432, 2003.[38] John C Platt, Kristina Toutanova, and Wen-tau Yih. Translingual documentrepresentations from discriminative projections. In Proceedings of the 2010Conference on Empirical Methods in Natural Language Processing, pages251–261. Association for Computational Linguistics, 2010.[39] Stephen Robertson. Understanding inverse document frequency: on theoret-ical arguments for idf. Journal of documentation, 60(5):503–520, 2004.[40] Stephen Robertson, Hugo Zaragoza, et al. The probabilistic relevance frame-work: Bm25 and beyond. Foundations and Trends R© in Information Re-trieval, 3(4):333–389, 2009.[41] Ruslan Salakhutdinov and Geoffrey Hinton. Semantic hashing. InternationalJournal of Approximate Reasoning, 50(7):969–978, 2009.[42] Gerard Salton and Christopher Buckley. Term-weighting approaches in auto-matic text retrieval. Information processing & management, 24(5):513–523,1988.31Bibliography[43] Aliaksei Severyn and Alessandro Moschitti. Learning to rank short text pairswith convolutional deep neural networks. In Proceedings of the 38th Interna-tional ACM SIGIR Conference on Research and Development in InformationRetrieval, pages 373–382. ACM, 2015.[44] Manfred J Sippl. Biological sequence analysis. probabilistic models ofproteins and nucleic acids, edited by r. durbin, s. eddy, a. krogh, and g.mitchinson. 1998. cambridge: Cambridge university press. 356 pp.£ 55.00(80.00)(hardcover);£19.95(34.95)(paper). Protein Science, 8(3):695–695,1999.[45] Sheila Tejada, Craig A Knoblock, and Steven Minton. Learning domain-independent string transformation weights for high accuracy object identifi-cation. In Proceedings of the eighth ACM SIGKDD international conferenceon Knowledge discovery and data mining, pages 350–359. ACM, 2002.[46] William E Winkler. Overview of record linkage and current research direc-tions. In Bureau of the Census. Citeseer, 2006.32
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Finding a record in a database
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Finding a record in a database Fatemi, Bahare 2017
pdf
Page Metadata
Item Metadata
Title | Finding a record in a database |
Creator |
Fatemi, Bahare |
Publisher | University of British Columbia |
Date Issued | 2017 |
Description | Consider the following problem: given a database of records indexed by names (e.g., of companies or restaurants) and a new name, determine whether the new name is in the database, and if so, which record it refers to. This problem is called record linkage. Record linkage is a challenging problem because people do not consistently use the official title of a company, but use abbreviations, synonyms, different orders of terms, and the title can contain typos. We provide a probabilistic model using relational logistic regression to find the probability of each record in the database being the desired record for a given query, and find the best record(s). Our model addresses many of challenges of the record linkage problem and provides good results when exact term matching search algorithms fail. We evaluate our model on a large real-world data set. Obtained results show that the model is a promising probabilistic record linkage model. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2017-08-10 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0353194 |
URI | http://hdl.handle.net/2429/62575 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2017-09 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 24-ubc_2017_september_fatemi_bahare.pdf [ 1.69MB ]
- Metadata
- JSON: 24-1.0353194.json
- JSON-LD: 24-1.0353194-ld.json
- RDF/XML (Pretty): 24-1.0353194-rdf.xml
- RDF/JSON: 24-1.0353194-rdf.json
- Turtle: 24-1.0353194-turtle.txt
- N-Triples: 24-1.0353194-rdf-ntriples.txt
- Original Record: 24-1.0353194-source.json
- Full Text
- 24-1.0353194-fulltext.txt
- Citation
- 24-1.0353194.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0353194/manifest