Sampling and Inference inSocial NetworksbyDon Buddhika Wijayantha NettasingheB.Sc. Eng., University of Peradeniya, Sri Lanka, 2013A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF APPLIED SCIENCEinThe Faculty of Graduate and Postdoctoral Studies(Electrical and Computer Engineering)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)December 2016c© Don Buddhika Wijayantha Nettasinghe 2016AbstractThis thesis explores three practically important problems related to social networks andproposes solutions utilizing tools from statistical signal processing and applied mathematics.In the first problem, we consider the maximum likelihood estimation of the degree distri-bution of a network from a sampled network. We are interested in the variance of the esti-mates for two widely used degree distribution models of large scale networks: (1) exponentialmodel (2) power-law model. We find that the variance of the estimate is approximately anorder of magnitude smaller for exponential model when compared to the power-law model.The intuition behind this interesting property is conjectured to follow from the second orderstochastic dominance of the probability distributions.In the second problem, we consider a social network where each node has a vector ofparameters (interests) associated with it. Only the partial graph and parameters of a sub-set of nodes are known. Our aim is to identify the parameters associated with each node.Assuming the Homophily behaviour of the social network, we propose a Cayley-Menger Deter-minant based optimization approach combined with a previously proposed spectral clusteringmethod to solve the interest identification problem.In the third problem, we study interest influencing in social networks where the aim is tofind the minimum set of missing edges in order to maximize the influential power over theinterests of the individuals in the network. This problem is formulated as a classic problemin structural rigidity theory using the characteristic Contagion behaviour of social networks.We propose a solution approach using the Laman’s theorem which characterizes minimallyrigid graphs in the two dimensional space.iiPrefaceThe following publications have resulted from the research presented in this thesis:• D. B. W. Nettasinghe, V. Krishnamurthy, and V. K. Bhargava, “Identification of In-dividual Interests in Online Social Networks Using Cayley-Menger Determinants,” ac-cepted for presentation at 2017 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP). (Linked to Chapters 3)Statement of AuthorshipI am the primary author for the publications listed above. I have been responsible fordeveloping original ideas, deriving mathematical solutions, and generating simulation resultsfor these publications with valuable frequent suggestions, advice and feedback from Prof.Vikram Krishnamurthy and Prof. Vijay K. Bhargava.A version of the content presented in the Appendix A has been submitted as a courseproject for STAT 547C: Topics in Probability course.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.1 Effect of the Underlying Model on Estimating the Degree Distributionof a Network Under Sampling . . . . . . . . . . . . . . . . . . . . . . 21.1.2 Interest Identification in Online Social Networks . . . . . . . . . . . 41.1.3 Interest Influencing in Online Social Network . . . . . . . . . . . . . 51.2 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2.1 Estimating the Degree Distributions of Networks Under Noise . . . . 61.2.2 Identification of Individual Interests in Online Social Networks . . . 71.2.3 Interest Influencing in Online Social Networks . . . . . . . . . . . . . 7ivTable of Contents1.3 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Estimating the Degree Distributions of Networks Under Noise . . . . . 102.1 Why do we Sample Networks? . . . . . . . . . . . . . . . . . . . . . . . . . 102.2 Estimating Degree Distributions from Sampled Networks: A Model BasedComparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2.1 Problem Definition and Main Result . . . . . . . . . . . . . . . . . . 122.2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3.1 Maximum Likelihood Estimation: A brief Introduction . . . . . . . . 132.3.2 Deriving Maximum Likelihood Estimators . . . . . . . . . . . . . . . 152.3.3 Crame`r-Rao Bound Comparison . . . . . . . . . . . . . . . . . . . . 162.4 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.5 Intuitive Explanation: Second Order Stochastic Dominance . . . . . . . . . 202.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 Identification of Individual Interests in Online Social Networks . . . . . 243.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2 Problem Definition and Assumptions . . . . . . . . . . . . . . . . . . . . . . 253.3 Interest Identification Using Cayley-Menger Determinants . . . . . . . . . . 263.4 Formulation of the Optimization Problem: A Numerical Example . . . . . . 323.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334 Interest Influencing in Online Social Networks . . . . . . . . . . . . . . . . 354.1 Interest Influencing: Problem Definition and Assumptions . . . . . . . . . . 354.2 Interest Influencing Using Rigidity Theory . . . . . . . . . . . . . . . . . . . 374.3 Assessing Rigidity in 2-Dimensional Space: Numerical Examples . . . . . . 384.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40vTable of Contents5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 41Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43AppendixA Methods of Sampling Networks: A Brief Survey . . . . . . . . . . . . . . 50A.1 Sampling by Random Node Selection . . . . . . . . . . . . . . . . . . . . . . 50A.1.1 Random Node Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 50A.1.2 Breadth-First Sampling (BFS) . . . . . . . . . . . . . . . . . . . . . 51A.2 Sampling by Random Edge Selection . . . . . . . . . . . . . . . . . . . . . . 52A.3 Sampling by Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52A.3.1 Random Walk Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 52A.3.2 Frontier Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56viList of Figures2.1 Variation of the CRLBs of the estimators, λˆ and γˆ, with the true parametersλ and γ, in their respective regions of interest . . . . . . . . . . . . . . . . . 182.2 Variation of the MSEs of the estimators, λˆ and γˆ, with the true parameters,λ and γ, in their respective regions of interest . . . . . . . . . . . . . . . . . 192.3 The distributions considered to depict the SOSD . . . . . . . . . . . . . . . . 212.4 The integral of the CDFs from negative infinity showing the SOSD . . . . . 223.1 Illustration of the locations of the reference nodes (black dots) and triangula-tion process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.1 (a) Flexible graph (b) A minimally rigid graph (c) Redundantly rigid graphin two dimensional space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39viiList of AbbreviationsAIC : Akaike Information CriterionBFS : Breadth-First SamplingCDF : Cumulative Distribution FunctionCRLB : Crame`r-Rao Lower BoundID : Identification NumberMISC : Missing node Identification by Spectral ClusteringML : Maximum LikelihoodMSE : Mean Squared ErrorRDN : Random Degree NodeRW : Random WalkSOSD : Second Order Stochastic DominanceviiiAcknowledgementsFirst and foremost, I would like to thank my advisors, Professor Vikram Krishnamurthyand Professor Vijay K. Bhargava, for their support, guidance and and generous financialsupport. I am grateful to Prof. Vikram Krishnamurthy for always making time to have longdiscussions related to various aspects of statistical signal processing that ignited my passiontowards this research area and for guiding me to explore the various tools and techniquesstudied in this thesis. I am grateful to Professor Vijay K. Bhargava for the academic guidancehe provided from the beginning of my MASc and for allowing and facilitating me to pursuethe areas that I have always been passionate and curious about, and thus making my MAScan amazing learning experience. This thesis would not have been possible without theirsupport, advice, and encouragement.Secondly, I would like to extend my sincere thanks to all the excellent colleagues at theInformation Theory and Systems laboratory, who offered me genuine and friendly supportto carry out my research. I should particularly thank K.N.R. Surya Vara Prasad for hisfriendship and advice in making many academic decisions during my MASc. I would alsolike to thank Duong Nguyen, Mark Adams, Sudha Lohani, Shankhanaad Mallick, PeiranWu, Jun Zhu, Rudi Plesch, Anup Aprem, Shahrear Tanzil, Nandinee Fariah, Sujay Bhattand Tavis Pederson for their friendship and support during my MASc research.Lastly, I am profoundly indebted to my parents, brother and friends in Sri Lanka for theirunconditional love and blessings. Without their continuous support and encouragement, Icould not have completed this thesis.ixDedicationTo my father Don Chandrapala Padmasiri Nettasinghe, mother Gunasiri Rajapaksa Manike,and brother Buddhima Nettasinghe.With gratitude for your love and support.xChapter 1Introduction1.1 OverviewSocial networks have become a crucial part of the modern society: people use social networksto connect with family and friends, to share news and information, to get updates on events,to obtain the views of others regarding various issues, etc. This integration of social networksinto the day to day lives of people has made drastic changes in the way people make decisionsin voting, purchase goods, adapt new beliefs or views, etc.These drastic effects of the social networks on almost every aspect of modern society hasgiven rise to a growing interest in the research community to model and analyze the socialnetworks in order to understand their dynamics in an abstract manner. Further, since asocial network is a highly complex and dynamic network that continuously evolves, toolsand techniques from a wide variety of research areas such as electrical engineering, computerscience, microeconomic theory and statistics can be employed to model various aspects ofsocial networks. For example, probability theory and statistical inference is used to estimatevarious characteristics of a social network such as degree distributions [24], tools arisingfrom microeconomics such as game theory and revealed preference are used to study howindividuals with limited awareness of the network can achieve global coordination and tomodel various human behaviours [19, 34] and concepts from machine learning theory areused to learn and update the operating policies of individuals in social networks.In this thesis, we explore three problems related to social networks that are of highpractical importance, namely, model based estimation of the degree distribution of a social11.1. Overviewnetwork, identification of individual interests in a partially known network and interestinfluencing in a social network. In the rest of this section, we provide a brief introductionabout each these problems and explain the motivation to explore that problem. Each ofthese problems will be discussed in detail in the subsequent chapters.1.1.1 Effect of the Underlying Model on Estimating the DegreeDistribution of a Network Under SamplingOne of the most important characteristics of a network is the degree distribution Dk whichgives the probability of a node chosen uniformly at random having k number of neighbours.Key properties of a network such as connectivity of the graph, information diffusion, exis-tence of a giant component can be analyzed using the degree distribution of a network [34].Therefore, the estimation of the degree distribution is of very high importance in analyzinga social network.Two of the most widely used models for the degree distributions of large scale networksare exponential model (with degree distribution Cexpe−λk characterized by the parameterλ ) and power-law model (with degree distribution Cplk−γ characterized by the parameterγ). The exponential model is used for real world networks such as Maritime TransportationNetwork, Email Networks and Power Grid Networks [12] with parameter λ ∈ (0, 1). Further,such a degree distribution also holds for an Erdo˝s-Re´nyi random graph [52]. The power-lawmodel is used for scale-free networks such as social networks, World Wide Web [12] withparameter γ ∈ (2, 3) [8, 10].However, since most of the real world networks currently consists of millions of nodes, itis not feasible to analyze the complete network in order to identify the parameters of the un-derlying degree distribution model (this will be further explained in Section 2.1). Therefore,the most commonly used approach to estimate the degree distribution of a social networkis to utilize a survey based sampling approach instead of analyzing the whole network. A21.1. Overviewsurvey based sampling method includes a sampling process to select a subset of nodes and aquerying method to obtain the parameter of interest (the true degree of the node in this case)from each sampled node. These two steps can be thought of as the step of selecting peoplefor an experiment (sampling step) and asking questions from the selected people (queryingstep).However, this survey based sampling introduces two types of noises (or random errors)into the process of estimating the degree distribution, namely, sampling noise and mea-surement (querying) noise. Sampling noise represents the randomness introduced from thesampling process (the process with which we select the participants of the survey). Measure-ment noise is introduced from the querying process. For example, consider the scenario ofusing a Facebook survey to ask every sampled node about his/her number of friends. Mea-surement noise accounts for the likely situation where a person does not actually providehis/her correct answer in such a survey (most work in literature assume that there is nomeasurement noise i.e. all queries of sampled nodes yield the true degree of that node in theoriginal network). Due to these two types of noises, the accurate estimation of the degreedistribution of the underlying network becomes a non-trivial problem.Motivated by the above facts, in Chapter 2, we consider the problem of maximum likeli-hood estimation of a degree distribution from the degree distribution of a sampled networkfor two distinct cases: 1) The underlying graph is known to have an exponential distribution2) The underlying graph is known to have a power-law distribution. The primary aim of thischapter is to answer the question, “Does the underlying degree distribution of a net-work have an effect on the accuracy of the maximum likelihood estimation?”.Our main result in this section shows that the maximum likelihood estimation has a betterperformance in exponential graphs when compared to the power-law graphs. Specifically,we show that the variance of the maximum likelihood estimate is approximately an order ofmagnitude smaller for exponential graphs compared to the power-law graphs. We furtherconjecture that the reason for this behaviour is the second order stochastic dominance of the31.1. Overviewprobability distributions.1.1.2 Interest Identification in Online Social NetworksIn a social network with graph G = (V,E), each individual has a set of parameters that areunique to him/her which are represented by the vector h(vi) for each node vi ∈ V . Theseparameters could be related to interests or beliefs, shopping preferences, etc 1. The knowledgeabout such parameters of the people in a social network may be of very high importance andvalue for purposes such as marketing, security, etc. However, the parameters of all nodes inan online social networks may not be visible to everyone. There may exist some nodes whosesocial structure (how they connect to other nodes in the social network) and the associatedparameters are not known. The set of such nodes, Vm ⊂ V , are called missing nodes. Forexample, consider a company called ABC that has a set of customers subscribed in Facebook(a similar example is considered in [14, 15] for a different problem). The profiles of the setof people Vk ⊂ V who have subscribed to the services of the company are visible to thecompany. Hence, the parameters h(vi) of each customer vi ∈ Vk are known to the company.However, ABC is interested in knowing the parameters associated with the set of peopleVm who are not its subscribers since ABC can use the knowledge of such parameters toidentify the best set of potential customers who would be interested in the services of ABC.For example, if ABC is a company that sells sports cars, the parameter which indicates if aperson (who is not a subscriber of ABC) is interested in sports cars would be of interest toABC in order to identify potential customers. This motivates us to pursue the aim of thechapter which is formally stated below.We consider a social network represented by the undirected graph G = (V,E) in whichthere is an interest vector h(vi) ∈ Rd associated with each node vi ∈ V . We assume that1In Facebook, these parameters include date of birth, gender, religion, political views, current and pastoccupations, attended university/high school, current and previous places of residence, spoken languages,family and relationship information, shopping preferences, music and movie preferences, travelled places,attended events, hobbies.41.1. Overviewthe set of nodes Vm ⊆ V are missing, that is, we know neither how they connect to therest of the nodes nor their interests. We only know part of the network that is denoted asGk = (Vk, Ek). Therefore, V = Vk ∪ Vm and E = Ek ∪ Em. In this context, the aim of theChapter 3 is Interest Identification: Given Gk, the cardinality of Vm, i.e. |Vm|, and theinterests associated with each known node vi ∈ Vk, estimate the interests associated witheach missing node vj ∈ Vm.As the main results of the Chapter 3, we introduce a method to solve the problem ofinterest identification of missing nodes utilizing a Cayley-Menger Determinant based opti-mization approach.1.1.3 Interest Influencing in Online Social NetworkThe interests of the people in a social network may tend to vary over the time. This could bebeneficial or not beneficial to certain entities. For example, consider our previous example ofABC company. ABC company wants to make sure that it does not lose its existing customerbase. Hence, ABC company should make sure that the interests of its customers always staypositive towards the company. If strong ties could be built between customers who tend toleave the company and company’s most loyal customers, there could be a possibility thatthe customers who are about to leave would be motivated to stay further with the company.For example, the ABC company can introduce a customer that is about to leave it to fewof its loyal customers. If these new social ties are strong, there is a high possibility that thesaid customer will decide to continue to stay with the company.In the previous example, we build connections between people in order to stop an un-desirable change of interests. This is based on the concept of social contagion behaviour ofsocial networks. Contagion/influence causes a strong friendship between two persons to passattributes of one person to another person (induction) [9, 46]. Results given in [21] specif-ically show that shopping habits and leisure activities are contagious in social networks.Recently, [36] concluded that the individual behaviour of people in a social network depends51.2. Main Contributionsmore on their local network structure (how they are connected to the rest of the nodes),than the actual behaviour of the whole network, a phenomenon which is called “the major-ity illusion”. A very interesting article, [38], which recently appeared in Harvard BusinessReview discusses this further and says that “you only need handful of influencers to give theimpression that everyone is talking about your brand” due to this majority illusion paradox.Motivated by the above examples and facts, in Chapter 4, we explore this idea from amathematical point of view to find out how to create a network structure in order to gain amaximum influential power over the interests of people in the network. The primary questionanswered here is “How can we introduce new connections between individuals of anetwork in a manner such that the new connections will prevent the preferencesof individuals from changing?”. As our solution approach to this problem, we introducea method to identify the minimal set of new connections that should be added to the networkin order to influence the interests of the nodes of a social network. We achieve this byviewing the interests associated with the nodes of the social network as a bar and jointframework residing in the 2-dimensional Euclidean space and then utilizing concepts fromthe mathematical branch of rigidity theory.1.2 Main ContributionsIn the following subsections, a brief list of contributions of each chapter of this thesis is pro-vided. Detailed discussions of each contribution are provided in the corresponding chapters.1.2.1 Estimating the Degree Distributions of Networks UnderNoiseAs explained in Section 1.1.1, Chapter 2 compares the maximum likelihood estimates ofdegree distributions for exponential and power-law models. The main contributions of thischapter are:61.2. Main Contributions1. A complete formulation of the maximum likelihood estimation problem of the truedegree distribution from the sampled (noisy) distribution for both exponential andpower-law networks.2. Showing that the variance of the maximum likelihood estimate is approximately anorder of magnitude smaller for the exponential model compared to the power-law model(in the respective regions of interests of the parameters) from both numerical as wellas theoretical analysis.3. An intuitive explanation of the main result(contribution 2) using the second orderstochastic dominance concept from microeconomic theory.Apart from the above main technical contributions, a brief survey about sampling meth-ods for social networks is also provided in Appendix A for the completion of the Chapter2.1.2.2 Identification of Individual Interests in Online SocialNetworksBased on the motivation provided in Section 1.1.2, Chapter 3 explores the problem of interestidentification in online social networks. The main contributions of this chapter are:1. A complete formulation of the problem of interest identification assuming that thesocial network exhibits the Homophily Behaviour (friends have similar interests)2. An optimization approach to solve the problem of interest identification subject toconstraints obtained using Cayley-Menger Determinant based approach.1.2.3 Interest Influencing in Online Social NetworksBased on the motivation and the examples provided in Section 1.1.3, Chapter 4 considersthe problem of interest influencing. The main contributions of the Chapter 4 are:71.3. Outline of the Thesis1. A complete mathematical formulation of the problem of interest influencing.2. A method to solve the problem of interest influencing by using the tools from themathematical branch of Rigidity Theory.1.3 Outline of the ThesisThe outline of the rest of this thesis is as follows.Chapter 2 considers the maximum likelihood estimation of degree distribution of networksfor two models: exponential model and power-law model. It begins by emphasizing theimportance of sampling in analyzing a social network and explains the effect of sampling onestimating a parameter of a social network. It then defines the aim of the chapter whichis to find out if the accuracy of maximum likelihood estimates of the degree distributiondepends on the underlying model (exponential or power-law). Then, we derive the maximumlikelihood estimates of the degree distributions for the two models and theoretically comparetheir variances using Crame`r-Rao Lower Bounds to prove the main result (the variance of theestimate for the exponential model is approximately an order of magnitude smaller comparedto the power-law model). Then, a comparison of the mean squared errors of the estimatorsof the two models is provided in order to numerically verify the main result. Finally, aconjecture utilizing the Second Order Stochastic Domination of probability distributions isprovided to intuitively explain the main result of the section.Chapter 3 considers the problem of interest identification in social networks. We startby providing a brief explanation about the related work and then define the problem ofinterest identification. We also provide a detailed explanation of the main assumption ofour work, Homophily behaviour, and how we mathematically model it. Then, we explain oursolution to the interest identification problem which is a Cayley-Menger determinant basedoptimization approach and also provide a numerical example to illustrate the importantsteps of the proposed method.81.3. Outline of the ThesisChapter 4 considers the problem of interest influencing and begins by formulating theproblem based on the contagion behaviour of individuals in a social network. Then, theproposed method of solving the problem of interest influencing using the concept of rigiditytheory is explained. Finally, some numerical examples are provided to demonstrate mainsteps of the proposed work.Finally, Chapter 5 provides a conclusion of the work presented in this thesis and somefuture research directions.9Chapter 2Estimating the Degree Distributionsof Networks Under Noise2.1 Why do we Sample Networks?As described in Chapter 1, online social networks provide an interesting platform for studiesin various domains. However, studying such social networks is difficult due to two primaryreasons. The first reason is the massive size of these social networks. For example, in2016, Facebook reports 1.71 billion monthly active users [1] and Twitter reports 310 millionmonthly active users [2]. Due to these massive sizes of the online social networks, it isimpossible to perform any analysis on the actual network. The second difficulty arises inacquiring the data of each and every person in these massive social networks since most ofthe network administrators are not willing to provide their data for free [20].A practical solution to the above difficulties in social networks analysis is to select asmaller (compared to the original network) subnetwork from the original social networkwhich best represents the original network. This approach is called network sampling. Themain decisions [37] that we have to make in the network sampling can be listed as:• What is the sampling method?• What is the sample size?• How good is the selected sample?In general, sampling methods for networks can be classified in to three main categories:102.1. Why do we Sample Networks?1. Sampling by Random Node Selection2. Sampling by Random Edge Selection3. Sampling by Exploration.A brief discussion about each of the above sampling methods and their pros and cons isprovided in the Appendix A.It should be noted that a parameter of a sampled network will be a noisy estimate ofthe same parameter of the original network since the sampling process may not be able tocapture the original parameter accurately (as emphasized in Section 1.1.1 as well). Thisparameter in concern for this chapter is the degree degree distribution of the network andwe focus on estimating it under sampling noise in the following sections.The organization of the rest of the chapter is as follows. Section 2.2.1 first definesthe problem and then the Section 2.2.2 provides a brief survey of the key work that hasbeen done in this context and explains how our work differs from them. In Section 2.3,we state our assumptions and formulate the maximum likelihood estimation problem forthe two models and theoretically compare the variances of the estimates. In Section 2.4,the numerical results are provided for Mean Squared Error (MSE) to verify the theoreticalanalysis. Further, we provide an intuitive explanation of the main result using the conceptof second order stochastic dominance and risk aversion in Section 2.5. Finally, in Section2.6, we provide a summary of this chapter and some interesting future research directions ofthe presented work.112.2. Estimating Degree Distributions from Sampled Networks: A Model Based Comparison2.2 Estimating Degree Distributions from SampledNetworks: A Model Based Comparison2.2.1 Problem Definition and Main ResultConsider a network represented by the undirected graph G = (V,E). Our aim is to comparethe variance of the estimates of the degree distribution Dk of G, using the degree distributionD′k of the sampled network G′ = (V ′, E ′) for two distinct cases:1. The underlying graph G is known to have an exponential degree distribution with aparameter λ2. The underlying graph G is known to have a power-law degree distribution with a powerlaw exponent γ.The main result is that the variance is approximately an order of magnitude smallerfor the exponential model compared to the power law model in the respective regions ofinterests of the parameters from both numerical as well as theoretical analysis. This resulthas not been presented in the existing literature to the best of our knowledge and thisresult emphasizes that more accurate techniques should be employed when estimating thepower-law degree distributions compared to exponential degree distributions.2.2.2 Related WorkRecently, [54] has analyzed the problem of estimating the degree distribution of networksusing a sampled network. Their work is based on the findings in [16, 17] which says that,for some sampling methods (with no observation noise), the sampled distribution (D′) andthe true distribution (D) are related according to,E[D′] = PD (2.1)122.3. Problem Formulationwhere P is a matrix that depends fully on the sampling scheme. They claim that theoperator P is not invertible in practice (ill-conditioned) and hence use a weighted leastsquares based approach to solve the problem of estimating the true degree distribution, D.The main difference between this work and the method that we present lies in the fact thatour approach is model-based, whereas the the work in [54] is design-based. In other words, weconsider our observation to be a realization from a stochastic process that depends on a fixedparameter, whereas the randomness in design-based methods is assumed to be only due tosampling approach [25]. Further, [25] specifically concludes that design based methods sufferfrom severe limitations in network inference and hence model-based methods are preferred.[4] considers the problem of estimating the parameter of a power law distribution (notnecessarily related to networks) using maximum likelihood based methods. However, in thiswork, the likelihood of the data given the model is considered with the assumption of nonoise, whereas the noise is taken into account in our model. Further, the main goal of [4] isto compare the maximum likelihood based method with other methods such as least squarebased methods and graphical methods, whereas our intention is to compare the performanceof maximum likelihood estimation for two models. Similarly, [10] also utilizes maximumlikelihood based approach to analyze data following power law distributions. Similar to [4],authors of [10] also mention that maximum likelihood based approaches are better whencompared to other methods and attempt to detect and characterize power law distributions.2.3 Problem Formulation2.3.1 Maximum Likelihood Estimation: A brief IntroductionConsider a system characterized by a deterministic parameter, θ∗ and a set of measurementsYN = (y1, . . . , yN). Intuitively, the aim of Maximum Likelihood (ML) Estimation is tocompute the most plausible value of the parameter which could have produced YN . This132.3. Problem Formulationplausibility is measured using the likelihood function, L(θ,N), which is defined as,L(θ,N) = P(YN |θ), θ ∈ Θ (2.2)where, Θ is the parameter space and the θ characterizes the parameterized model. The MLestimate θˆN of θ∗ is then defined as,θˆN = argmaxθ∈ΘL(θ,N). (2.3)Often, it is more convenient to maximize log-likelihood,L(N, θ) = lnL(θ,N) (2.4)since maximizers of both functions are the same.The reasons for using ML estimation in our work are the following two properties [6, 26].1. Strong Consistency: Under reasonable conditions, the ML estimate, θˆN , convergesto the true parameter, θ∗, almost surely i.e.P(limN→∞θˆN → θ∗)= 1 (2.5)2. Asymptotic Normality: The ML Estimate, θˆN , is asymptotically normally dis-tributed about the true parameter, θ, with a variance equal to the inverse of the FisherInformation, Iθ∗ [29]A more detailed discussion about using ML estimation and its applications in varioussystem models (linear Gaussian state space, hidden Markov model, etc) can be found in [33].142.3. Problem Formulation2.3.2 Deriving Maximum Likelihood EstimatorsThe degree distribution of an exponential graph with a parameter λ is of the form,Dexp,k = (1− e−λ)eλkmine−λk (2.6)where kmin is the minimum degree of the graph.Similarly, the degree distribution of a power-law graph (in the range from kmin to kmax)with parameter γ is given by,Dpl,k =k−γζ(γ, kmin)− ζ(γ, kmax) , (2.7)where ζ(γ, k) is the Hurwitz zeta function,ζ(γ, k) =∞∑n=0(n+ k)−γ. (2.8)We assume that the noise introduced from the measurement process and/or the samplingprocess into the measured distribution, D′, to be additive Gaussian noise. [54] shows thatsampling noise can be approximated using a Gaussian model. If the sample size is consider-ably large, the measurement noise can also be approximated by a Gaussian random variableby the central limit theorem, irrespective of the actual underlying distribution of the noiseassociated with each query.Hence, it is fair to assume that the combined effect of both noises can be modelled by anormal random variable. Therefore, the measured distributions can be expressed as,D′exp,k = Dexp,k + ηexp,k (2.9)D′pl,k = Dpl,k + ηpl,k (2.10)152.3. Problem Formulationwhere ηexp,k and ηpl,k are both independent (in k) normal random variables with zero meanand variance σ2. For simplicity, we assume that kmin = 1 and kmax = N .Then, using (2.6) and (2.9), the log-likelihood function for the exponential parameter,Lexp(λ) = lnN∏k=1P(D′exp,k|λ), can be expressed as,Lexp(λ) = −N2ln(2piσ2)− 12σ2N∑k=1(D′exp,k − (1− e−λ)e−λ(k−kmin))2. (2.11)Similarly, using (2.7) and (2.10), the log-likelihood function for the power-law parameter,Lpl(γ) = lnN∏k=1P(D′pl,k|γ), can be expressed as,Lpl(γ) = −N2ln(2piσ2)− 12σ2N∑k=1(D′pl,k −k−γζ(γ, kmin)− ζ(γ, kmax))2. (2.12)The maximum likelihood estimators, λˆ and γˆ, for the parameters λ and γ can then beobtained as,λˆ = argmaxλ[Lexp(λ)] (2.13)γˆ = argmaxγ[Lexp(γ)]. (2.14)2.3.3 Crame`r-Rao Bound ComparisonIn order to theoretically compare the ML estimates for the two models, we analyze theCrame`r-Rao Lower Bounds (CRLBs). We consider a degree distribution with N = 30 foreach model. Further, we assume that the noise variance, σ2 = 0.05 for both exponential andpower-law models.CRLB is defined as the reciprocal of the Fisher Information [29]. The significance ofthe CRLB is that the variance of any unbiased estimator is bounded below by the CRLB.Based on the asymptotic normality of the maximum likelihood estimators, we can assume162.4. Numerical Resultsthat our estimators are unbiased (due to the large number of measurements) and thus,permitting the use of CRLBs for the comparison of the estimators. Further, ML estimatorsare asymptotically efficient i.e. when the number of measurements tends to infinity thevariance of the ML estimator will reach the CRLB [33].The CRLBs, λˆCRLB and γˆCRLB, of the two estimators, λˆ and γˆ, yields,var(λˆ) ≥ λˆCRLB = −[E[∂2Lexp(λ)∂2λ∣∣∣∣λ=λ∗]]−1(2.15)var(γˆ) ≥ γˆCRLB = −[E[∂2Lpl(γ)∂2γ∣∣∣∣γ=γ∗]]−1(2.16)where λ∗ and γ∗ are the true parameters. The second order partial derivatives inside theexpectation operator in (2.15) and (2.16) can be calculated easily and then the expectedvalues can be obtained using a Monte Carlo simulation over a large number of independentmeasurements. The resulting plots of the CRLBs for the two estimators are shown in Fig.2.1. It can be clearly seen that the CRLB of the estimator for the exponential model issmaller compared to the CRLB of the estimator for the power-law model.2.4 Numerical ResultsIn order to verify the theoretical comparison in Section 2.3.3, we compare the MSEs of theestimators for the two same models (N = 30 and σ2 = 0.05)The maximum likelihood estimation of the parameters was performed by numericallyevaluating (2.13) and (2.14). Then, the MSEs, λˆMSE and γˆMSE, of the estimators λˆ andγˆ, were obtained by a Monte Carlo method for different values of the true parameters, λand γ, within their respective regions of interests. Fig. 2.2 shows the resulting plots in thesame figure. It can be clearly seen from Fig. 2.2 that the MSE associated with the λˆ issignificantly smaller compared to the MSE associated with the γˆ in the respective regionsof interest. Hence, the MSEs associated with the two models show that the maximum172.4. Numerical Results0 0.5 1 1.5 2 2.5 3Parameters: λ and γ00.020.040.060.080.10.12Crame`r-RaoBoundsλˆCRLBγˆCRLBFigure 2.1: Variation of the CRLBs of the estimators, λˆ and γˆ, with the true parameters λand γ, in their respective regions of interest182.4. Numerical Results0 0.5 1 1.5 2 2.5 3Parameters: λ and γ00.010.020.030.040.050.06MeanSquaredErrorλˆMSEγˆMSEFigure 2.2: Variation of the MSEs of the estimators, λˆ and γˆ, with the true parameters, λand γ, in their respective regions of interestlikelihood estimation has a significantly better accuracy for exponential model compared tothe power-law model.By comparing Fig. 2.2 and Fig. 2.1, it can be seen that this theoretical analysis agreeswith the MSE comparison and thus, confirming the fact the maximum likelihood estima-tion has a smaller variance (approximately an order of magnitude smaller in the middle ofthe regions of interests) in the case of exponential distribution compared to the power-lawdistribution.192.5. Intuitive Explanation: Second Order Stochastic Dominance2.5 Intuitive Explanation: Second Order StochasticDominanceIn Section 2.3.3 (and numerically in Section 3.4), we showed that the maximum likelihoodestimator has a higher variance for power-law distributions compared to the exponentialdistributions with both numerical and theoretical analysis. However, the intuition behindthis interesting result is still not completely clear at this point. Hence, we attempt to providean intuitive explanation (as a conjecture) of this result using the concept of Second OrderStochastic Dominance (SOSD) [42].Consider the two Cumulative Distribution Functions (CDFs), F and G. F second-orderstochastically dominate G, denoted F SOSD G, if∫ x−∞F (t) dt ≤∫ x−∞G(t) dt, ∀x ∈ R. (2.17)This concept of SOSD is used in microeconomic theory for risk averse decision making.As a simple example, consider a person who has to make a choice between betting moneyon two lotteries with the same expected winnings and characterized by the two distributionsF and G where F SOSD G. If the person is risk averse (a person who tries to avoid risk)he should choose F . Further, the variances of the two random variables also are related tothe SOSD in the following way. If the two distributions F and G are equal in mean, thenF SOSD G implies that F has a smaller variance compared to G (the converse is not true).Fig. 2.3 shows an exponential distribution function (λ = 0.657) and a power-law distributionfunction (γ = 2.2) with equal expected values (2.0772), their CDFs and the integrals of theCDFs (the integrals in (2.17) are replaced by summations for this discrete case). It can beseen from Fig. 2.3(c) that the exponential distribution second order stochastically dominatethe power-law distribution. Hence, we conjecture that SOSD can be the reason behind thesignificant difference of the variances of the two estimators. In other words, this means that202.5. Intuitive Explanation: Second Order Stochastic Dominance0 5 10 15 20 25 30Degree00.10.20.30.40.50.60.7ProbabilityDistributionsExponential DistributionPower-Law Distribution(a) The exponential distribution (λ = 0.657) and power-law distribution (γ = 2.2) with equal means0 5 10 15 20 25 30Degree0.40.50.60.70.80.91CumulativeDistributionsExponential DistributionPower-Law Distribution(b) The CDFs of the two distributionsFigure 2.3: The distributions considered to depict the SOSD212.6. Summary0 5 10 15 20 25 30Degree051015202530IntegralofCDFs(from−∞)Exponential DistributionPower-Law DistributionFigure 2.4: The integral of the CDFs from negative infinity showing the SOSDit is less accurate to estimate parameters associated with a mean preserving spread of adistribution compared to the original distribution. This agrees with the intuition becauseit is always difficult to estimate a mean preserving spread, because it is a more disperseddistribution compared to the original distribution.2.6 SummaryThis chapter presented a comparison between the maximum likelihood estimation of net-work degree distributions from degree distributions of sampled networks for two cases: (1)the underlying network has an exponential degree distribution (2) the underlying networkhas a power-law degree distribution. It was assumed that the noise introduced from sam-pling/measurement process follows a zero mean Gaussian distribution (as a result of thecentral limit theorem) and that we know the type of the underlying distribution. Maximumlikelihood estimators were derived for both models and then their variances were analyzed222.6. Summaryusing both mean squared errors and the Crame`r-Rao Bounds. The results show that themaximum likelihood estimator of the exponential model has a significantly smaller variancecompared to the maximum likelihood estimator of the power-law model. Hence, maximumlikelihood estimation has a better accuracy when the underlying network has an exponen-tial degree distribution compared to a power-law degree distribution. The reason for thisinteresting behaviour was then conjectured to be the Second Order Stochastic Dominanceof probability distributions. This was illustrated by considering a situation where the twodistributions have the same mean.23Chapter 3Identification of Individual Interestsin Online Social Networks3.1 Related WorkThe first step of interest identification is to identify the missing nodes and how they connectto the rest of the nodes. For this step, we use the missing node identification by spectralclustering (MISC) algorithm proposed in [14, 15] which returns the prediction of the fullnetwork graph Gˆ = (Vˆ , Eˆ) using only the number of missing nodes and an indication forthe existence of missing nodes and edges. For example, this indication could be a Facebookpost by an existing customer of the ABC Sports Car company mentioning about a friendwho also might be interested in sports cars, but whose profile is not visible to the company.Similarly, these indications could also be Facebook posts, photos, tweets, etc. dependingon the application. Image and text recognition software can be utilized to identify suchindications in a practical implementation scenario. Readers are highly encouraged to read[14, 15] for a full description about the MISC algorithm and a comparison with the othernetwork inference methods. Various other algorithms that are based on observations of adiffusion process are also presented in [11, 32, 44]. However, we use the MISC algorithm asthe first step of our proposed theoretical framework since it fits our assumptions and modelperfectly and then build the interest identification method from there.Cayley-Menger Determinants are used to express the volumes of simplexes in Euclideanspaces in terms of the distances between the nodes [5, 48, 49]. This concept has been used243.2. Problem Definition and Assumptionspreviously for applications in sensor networks, localization and robotics such as [30, 31, 51].In our work, we use Cayley-Menger Determinants to obtain constraints relating the distancesbetween the known and unknown interest vectors in order to formulate an optimizationproblem similar to the method proposed in [7].The problem of the interest identification in a social network from a partial graph andpartial set of interests has not been addressed before in the published literature to the bestof our knowledge. Further, the novel model proposed in our work allows the use of resultsorienting from distance geometry such as Cayley-Menger Determinants which have not beenutilized previously in the context of social network research. Therefore, we believe that thiswork will open new avenues for social network research.The organization of the rest of this chapter is as follows. Section 3.2 defines the problemof interest identification and states the main assumptions of the work. Then, Section 3.3describes the proposed interest identification method and Section 3.4 provides a numericalexample to illustrate the important steps of the proposed method. Finally, Section 3.5summarizes the work presented in this chapter with some concluding remarks.3.2 Problem Definition and AssumptionsThe identified social network graph Gˆ = (Vˆ , Eˆ) (output of the MISC algorithm as describedin Section 3.1) can be used to estimate the interests of the set of newly identified nodesVˆm = Vˆ \Vk. We can define this problem of identifying interests of nodes as below.Problem (Interest Identification) Definition : given a social graph Gˆ = (Vˆ , Eˆ)where Vˆ = {Vk, Vˆm} is a partition of Vˆ , and a function h : Vˆ → Wˆ ⊆ Rd where only Vˆ andh(Vk) are known, find h(Vˆm).Our primary assumption in order to solve this problem is that the social network is amicro environment (a small network) which exhibits the Homophily behaviour. Homophily253.3. Interest Identification Using Cayley-Menger Determinantsprinciple states that there exists a connection between the social network of a person andhis/her interests. Simply put, this principle states that similarities between people causesconnections to be made between them (there is a tendency of nodes to be attached to othernodes that have similar characteristics) [27, 40]. In other words, Homophily behaviour statesthat “there exists a connection between how people are connected (the underlying spatialstructure) and their similarity of interests”. We represent this assumption as a function, f ,that maps some graph theoretic distance metric, g(vi, vj), between each pair of nodes, (vi, vj),to some value that is approximately equal to the Euclidean distance between interest vectorsof those nodes. Such distance metrics (domain of f) that are purely based on the structuralproperties of the graphs have been studied before for similar contexts in [14, 15]. Similarto our assumption, [18] specifically assumes a dependency between a set of feature vectorsassociated with the nodes and the network topology. This assumption can be mathematicallyformulated as:Main assumption: There exists a function g : Vˆ × Vˆ → B and a function f : B → R+such that B ⊆ R and f ◦ g : Vˆ × Vˆ → R+ is given byf ◦ g(vˆi, vˆj) = ‖h(vˆi)− h(vˆj)‖2 − εi,j, ∀(vˆi, vˆj) ∈ Vˆ × Vˆ (3.1)for some small εi,j.3.3 Interest Identification Using Cayley-MengerDeterminantsThe function g can be selected to be one of the widely used graph theoretic metrics such asGaussian distance, inverse squared shortest path, relative common friends [15] or some othermetric defined to suit the network in concern. Then, the function f ◦ g can be thought of asa noisy function which maps each pair of nodes to the squared Euclidean distance between263.3. Interest Identification Using Cayley-Menger Determinantsthe interests of those nodes. We can cluster the squared distances between interest vectorsof all the pairs of nodes (vˆi, vˆj) ∈ Vk × Vk that have the same metric g(vˆi, vˆj) as a setCg(vˆi,vˆj) = {d2ij|(vˆi, vˆj) ∈ g−1({g(vˆi, vˆj)}) ∩ Vk × Vk} (3.2)where,d2ij = ‖h(vˆi)− h(vˆj)‖2. (3.3)This would give |g(Vk × Vk)| number of distinct clusters of the Euclidean distances ofinterest vectors. Then, a discrete function fd : g(Vk × Vk)→ R+ can be defined asfd(g(vˆi, vˆj)) = C¯g(vˆi,vˆj), ∀vˆi, vˆj ∈ Vk (3.4)where, C¯g(vˆi,vˆj) is the arithmetic mean of the cluster Cg(vˆi,vˆj).This process can be repeated in a prior training stage to determine the best graph the-oretic metric, g, from a set of candidate metrics g1, g2, . . . , gm by minimizing a function ofthe summation of weighted variances of each cluster asg = argmingi∑Cgi(vˆi,vˆj)λgi(vˆi,vˆj) × var(Cgi(vˆi,vˆj)) (3.5)where λgi(vˆi,vˆj) is the weight assigned to the cluster corresponding to the value gi(vˆi, vˆj).The discrete data points of function fd, obtained by (3.4) can now be used to obtain acontinuous function f by fitting a polynomial of degree nopt. Then, the domain of f will be[sup(g(Vk × Vk)), inf(g(Vk × Vk))]. The optimal degree, nopt, can be determined by a modelorder determination method such as Akaike Information Criterion (AIC). We assume that theknown part of the network is large enough to make sure g(Vˆ × Vˆ ) ⊆ [sup(g(Vk × Vk)), inf(g(Vk × Vk))].Then, the function f ◦ g : Vˆ × Vˆ → R+ can be used to obtain a noisy estimate of thesquared Euclidean distance between any pair of nodes (vˆi, vˆj) ∈ Vˆ × Vˆ . The noise freedistance is known only for all (vˆi, vˆj) ∈ Vk × Vk.273.3. Interest Identification Using Cayley-Menger DeterminantsIn a d dimensional Euclidean space, we can calculate the coordinates of any unknownpoint, if the accurate distances from that point to d + 1 known points are available. Forexample, the coordinates of any point in a plane (d = 2) can be calculated by triangulationif the distance from that point to 3 known non-collinear points are available. Hence, toestimate the interest vector h(vˆi) of vˆi ∈ Vˆm, we need the distances between h(vˆi) and d+ 1known nodes. These distances can be obtained with some error using the previously obtainedcomposite function f ◦ g. However, the actual distances are not independent of each otherdue to the fact that they reside in a Euclidean space. [7] proposes a method which describesthe dependencies between the distances of points in a Euclidean space as a set of quadraticequality constraints. Since the actual distances are not independent, the errors that areassociated with the distances obtained using f ◦ g are also not independent. Hence, thesedependencies between the errors can be used to find the set of errors which would yield theactual distances as close as possible. Therefore, we find an elegant adaptation of the methodproposed in [7] (for sensor network localization) combined with results from [49] to suit ourpreviously described model and assumptions in order to find h(Vˆm).This method is based on the Cayley-Menger Determinants which are used to find thehypervolumes of simplexes. The volume, V (p0, . . . , pk), of a k-dimensional simplex withk + 1 points p0, . . . , pk can be expressed asV 2(p0, . . . , pk) =(−1)k+12k(k!)2CM(p0, . . . , pk) (3.6)where,CM(p0, . . . , pk) =∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣0 d201 . . . d20k 1d201 0 . . . d21k 1. . . . . . . . . . . . . . . . . . . .dk0 d2k1 . . . 0 11 1 . . . 1 0∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣(3.7)283.3. Interest Identification Using Cayley-Menger Determinantsand dij is the Euclidean distance between points pi and pj of the simplex. CM(p0, . . . , pk)is called the Cayley-Menger determinant of the points p0, . . . , pk [49].The following theorem is a popular result in distance geometry which gives the necessaryand sufficient conditions for the embeddability of a distance matrix in a Euclidean space.The full proof of it is given in [5] and an excellent description is also given in [49].Theorem 1 (Global Embedding Theorem). A distance matrix of h+1 points D(p0, p1, . . . , ph)is embeddable in a Euclidean space En of dimension n if and only ifi. There exists a submatrix R(p0, p1, . . . , pn) of n + 1 points of D so that, for every k =1, 2, . . . , n, the sign of CM(p0, . . . , pk) is equal to (−1)k+1;ii. For every pair pi, pj(i, j = n+ 1, ..., h)CM(p0, . . . , pn; pi) = 0 (3.8)CM(p0, . . . , pn; pj) = 0 (3.9)CM(p0, . . . , pn; pi, pj) = 0 (3.10)The first part of Theorem 1 states that there exists a reference object whose n-dimensionalvolume does not vanish if the set of points are embeddable in the Euclidean space of ndimension. In 2-dimensional space, this is equivalent to finding 3 points that are not co-linear so that the area of the triangle formed by the three points does not vanish. Similarly,in 3-dimensional space, this is equivalent to finding four points that are not co-planer so thatthe volume of the object formed by the formed by the four points does not vanish. We fulfillthis condition in our method by finding a set VR ⊆ Vk of d+ 1 nodes such that the interestvectors of the nodes in VR form a simplex in Rd whose d dimensional volume is not zero i.e.CM(h(VR)) 6= 0. Let the elements of Vˆ be re-labeled so that the indices from 1 to d+ 1 are293.3. Interest Identification Using Cayley-Menger Determinantsassigned to elements of VR, thus making VR = {vˆ1, vˆ2, . . . , vˆd+1}. We further define,d¯2ij = f ◦ g(vˆi, vˆj), ∀vˆi, vˆj ∈ Vˆ . (3.11)Then, we haved¯2ij = d2ij − εi,j (3.12)where εi,j is the error of the function f ◦ g. From the second part of Theorem 1, we knowthatCM(h(VR);h(vˆi)) = 0,∀vˆi /∈ VR. (3.13)Therefore, for any vˆi ∈ Vˆm, from equation 3.13 we get∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣0 d2i,1 . . . d2i,d+1 1d2i,1 0 . . . d21,d+1 1. . . . . . . . . . . . . . . . . . . . . . . . . . .di,d+1 d21,d+1 . . . 0 11 1 . . . 1 0∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣= 0. (3.14)The above expression can be decomposed as[d2i,1 . . . d2i,d+1 1]Ed2i,1...d2i,d+11= 0. (3.15)303.3. Interest Identification Using Cayley-Menger Determinantswhere,E =0 d21,2 . . . d21,d+1 1d21,2 0 . . . d22,d+1 1. . . . . . . . . . . . . . . . . . . . . . . . . . .d1,d+1 d22,d+1 . . . 0 11 1 . . . 1 0−1(3.16)as proven in [7]. The existence of the matrix E is guaranteed by our selection of nodes suchthat CM(h(VR)) 6= 0.Using (3.12) and (3.15), we get[d¯2i,1 + εi,1 . . . d¯2i,d+1 + εi,d+1 1]Ed¯2i,1 + εi,1...d¯2i,d+1 + εi,d+11= 0 (3.17)which defines a quadratic surface of the variables εi,j, j ≤ d+ 1 associated with vˆi ∈ Vˆm.This can be formulated as a least squares problem where we minimize the sum of errors,J = ε2i,1 + ε2i,2 + . . .+ ε2i,d+1, subject to the derived quadratic constraint obtained by (3.17)which is a well studied optimization problem that can be solved using methods such as mul-tiplier method [7]. Then, the resulting set of errors (ε2i,1, ε2i,2, . . . , ε2i,d+1) could be associatedwith the erroneous distances (d¯2i,1, d¯2i,2, . . . , d¯2i,d+1) obtained from the composite function f ◦ gto obtain the estimate of the true distances. These distances can then be used to easilycalculate the interest vector of the node in consideration using triangulation.It should be noted that further quadratic constraints can be obtained by selecting adifferent set of elements for VR, fulfilling the non-zero hypervolume condition. Then, theoptimization problem could be solved subject to multiple quadratic constraints and thusreducing the search space in the optimization problem and yielding a better accuracy. How-ever, this should be done depending on the available computational capacity. The interest313.4. Formulation of the Optimization Problem: A Numerical Examplevectors of all the missing nodes can be identified by repeating this process for all nodes inVˆm.3.4 Formulation of the Optimization Problem: ANumerical ExampleConsider a missing node vˆi ∈ Vˆm of the identified social network Gˆ where the interests of thenodes lie in the 2-dimensional Euclidean space (d = 2). Assume that our reference nodesVR = {v1, v2, v3 ∈ Vk} have the interests h(v1) =[0 0]T, h(v2) =[1 8]T, h(v3) =[2 3]Tand let the outputs of the composite function in equation (3.1) be d¯2i1 = 16, d¯2i2 = 25, d¯2i3 = 9(noisy squared distances between the missing node and the reference nodes).By substituting h(v1), h(v2) and h(v3) in (3.6), we can see that V2(v1, v2, v3) = 42.25,which fulfills the non-zero hypervolume condition of the part (i) of theorem 1 (can be seenfrom Fig. 3.1 as well). Using (3), we can obtain d21,2 = 65, d21,3 = 13 and d22,3 = 26 and then,substituting in (3.16) and (3.17), we can obtain the constraint of the optimization problemas:2ε2i,1 + ε2i,2 + 5ε2i,3 − 44εi,1 − 32εi,2 + 24εi,3 + 2εi,1εi,2 − 6εi,1εi,3 − 4εi,2εi,3 − 176 = 0. (3.18)Then, the errors can be calculated by minimizing J = ε2i,1 + ε2i,2 + ε2i,3 subject to (3.18).Solving optimization problems of this form has been extensively studied [22, 55] and canbe done using many methods. Further, [7] also arrives at a similar optimization problemin context of sensor networks where they use the Lagrangian multiplier method as the firststep.The calculated errors can now be associated with the noisy distances to obtain the actualdistances according to (3.12). This gives us three distances, di1, di2 and di3, to the unknowninterest vector from the three known vectors, h(v1), h(v2) and h(v3). Hence, the interest of323.5. Summary-5 0 5X-Axis02468Y-AxisFigure 3.1: Illustration of the locations of the reference nodes (black dots) and triangulationprocessthe missing node, h(vi), can be calculated easily using triangulation as shown in Fig 3.1.3.5 SummaryThis chapter proposed a solution to the problem of interest identification in social networks.Only the partial graph and the interests of a subset of nodes are known, and thus makingthis problem non-trivial to solve.Previous research has shown that there exists a relation between some graph theoreticdistance metric and the difference of interests (Euclidean distance) of each pair of nodesdue to the Homophily behaviour of social networks. This idea was utilized to obtain acomposite function which gives the noisy interest distances from d + 1 known nodes (in ad dimensional space) to a missing node. Then, equality constraints derived using Cayley-Menger Determinants were used to formulate an optimization problem in order to minimizenoise and find the d+ 1 distances to the missing node which can then be used to triangulate333.5. Summarythe interest associated with the missing node.34Chapter 4Interest Influencing in Online SocialNetworks4.1 Interest Influencing: Problem Definition andAssumptionsConsider the system model described in Section 1.1.2 again. In this chapter, we assumethat G = Gk (the network structure is known). As we pointed out in the Section 1.1.3, aconnection between two persons can prevent a change of interests. We model this behaviourby assuming that a connection (an edge in the graph) in a social network fixes the distancebetween the interests of the two nodes involved in that connection. In other words, we assumethat the interest distance between the nodes that are connected by an edge is not changingfurther. This assumption is particularly true in networks that are based on professional orresearch interests. In such networks, the similarity of interests (the fixed distance betweeninterest vectors) between two nodes is what makes an edge exist in the graph between them.Hence, the interest distance should remain same for that edge to exist. This assumption canbe mathematically formulated as:Assumption: If 〈vˆi, vˆj〉 ∈ E, thend‖h(vˆi)− h(vˆj)‖dt= 0. (4.1)354.1. Interest Influencing: Problem Definition and AssumptionsThis assumption introduces distance constraints which fix the distances between the pairsof interests vectors associated with adjacent nodes. Therefore, the triple G(h) = (V,E, h)can be viewed as a bar and joint framework where G = (V,E) is the social graph andh : V → Rd is the function that maps each node in the social network to its interest. Movingsuch a framework in the Euclidean space that it lives does not change the distances betweenthe pairs of interest vectors that correspond to adjacent nodes. However, this does notnecessarily ensure that the distances between non-adjacent vertices of the framework G(h)will remain fixed. Hence, we are interested in finding a method to ensure that all interestvectors (whether they are adjacent or not) keep their inter-distances fixed. Based on theprevious assumption, we can formally state this problem as:Problem (Interest Influencing) Definition : Given a bar and joint frameworkG(h) = (V,E, h), find En ⊆ V × V so that Gn(h) = (V,E ∪ En, h) is a rigid framework.The term rigid framework in problem 2 definition refers to a framework where everymotion preserves the distances between all pairs of vertices (flexible otherwise). Rigiditytheory is the branch in mathematics which studies the rigidity and flexibility of bar andjoint frameworks in Euclidean spaces [23, 45, 53]. As a simple example, a triangle formedby three rods and three joints is rigid in 2 dimensional space while a rectangle formed byfour rods and four joints is flexible in 2 dimensional space since one rod of the rectanglecan be moved while keeping another one fixed. The rectangle can be made rigid by addinga diagonal rod which will eliminate the additional degree of freedom and thus making theframework rigid. The rigidity theory develops a formal theoretical treatment for analysis ofrigidity of bar and joint frameworks. It has been shown that the rigidity of frameworks in2 dimensional Euclidean space can be analyzed purely by the properties of the underlyinggraph (neglecting the coordinates of joints). The concepts from rigidity theory have beenutilized previously for sensor network localization purposes in [3, 13, 47]. In our work, weuse rigidity theory to solve the second problem where the ultimate objective is to create364.2. Interest Influencing Using Rigidity Theoryrigid graph in 2 dimensional space. For example, if the interests of the customers form sucha rigid framework, we can fix the interests of all nodes by fixing the interests of few loyalcustomers.The organization of the rest of this chapter is as follows. Section 4.2 describes theproposed method of solving the problem of interest influencing using the concept of rigiditytheory. Section 4.3 provides numerical examples for the proposed method. Finally, Section4.4 provides a summary of the work presented in this chapter and some concluding remarks.4.2 Interest Influencing Using Rigidity TheoryFinding if a framework G(h) = (V,E, h) is rigid in d dimensional space involves a solving setof |E| quadratic equations with d|V | unknowns that fixes the distances of the set of edges ofthe framework,(hi(t)− hj(t))T (hi(t)− hj(t)) = cij, ∀〈vˆi, vˆj〉 ∈ E (4.2)where we have denoted h(vi) by hi. The framework is said to be to rigid, if all solutionsto the system of quadratic equations preserve the distances between all points irrespectiveof whether they are adjacent or not. Since finding a solution to this quadratic systemof equations is hard, the common approach is to linearize the problem by obtaining firstderivatives (which simplifies the problem to finding the rank of a matrix). However, in thecontext of social networks, computation of the rank of such matrices may not be practicaldue to the large size.As a solution to this, we can use Laman’s theorem [35] which completely characterizesminimally rigid (not rigid after removing one of its edges) graphs in two dimensional space.Theorem 2 (Laman’s Theorem). Let a graph G have exactly 2|V |−3 graph edges(|V | > 1).Then, G is rigid in R2 if and only if no subgraph G′ = (V ′, E ′) has more than 2|V ′|−3 edges(|V ′| > 1)374.3. Assessing Rigidity in 2-Dimensional Space: Numerical ExamplesA necessary (but not sufficient) condition for a graph to be rigid is to have at least 2|V |−3edges [50]. However, according to Laman’s theorem, these edges have to be well distributedin order to make a rigid graph. Hence, Laman’s theorem gives a purely combinatorial methodto analyze the rigidity of frameworks in 2 dimensions based only on the underlying graph(without considering the coordinates of vertices). In other words, rigidity of a frameworkG(h) = (V,E, h) in 2 dimensional space depends only onG = (V,E). The readers encouragedto refer chapter 2 of [50] for an excellent description about Laman’s theorem.However, such a combinatorial type of a theorem to assess the rigidity of frameworks indimensions of three and beyond is currently not available. Hence, the graph theoretic analysisof rigidity of frameworks in dimensions beyond two dimensions is not well understood andnot very useful from a computational point of view due to the lack of polynomial timealgorithms [50]. Hence, if we assume the number of parameters associated with nodes to betwo (d = 2), the Laman’s theorem and various combinatorial polynomial time algorithmsthat are based on it such as the pebble game algorithm [28] can be utilized to identify missingedges required to create a rigid network based on our system model and the assumption,thus giving the solution to the problem of interest influencing.4.3 Assessing Rigidity in 2-Dimensional Space:Numerical ExamplesHere, we provide 3 simple examples to illustrate how we can assess the rigidity of frameworksbased on the Laman’s theorem in a purely combinatorial manner.Assume that our social network has 5 nodes and their interests lie in the two dimensionalspace as represented in Fig.4.1. In the case of Fig.4.1(a), it can be seen that the numberof edges in the graph (6) is less than 2|V | − 3 = 7. Hence, the first graph does not satisfya necessary condition for rigidity and hence does not form a rigid structure. The structurein Fig. 4.1(b) is formed by adding an additional edge to the structure in Fig. 4.1(a) and384.3. Assessing Rigidity in 2-Dimensional Space: Numerical Examples(a) (b) (c)Figure 4.1: (a) Flexible graph (b) A minimally rigid graph (c) Redundantly rigid graph intwo dimensional spaceit can be seen that it satisfies the 2|V | − 3 edge count. Further, no subgraph G′ = (V ′, E ′)of it has more than 2|V ′| − 3 edges. Therefore, the structure in Fig. 4.1(b) is minimallyrigid. The structure in 4.1(c) has 8 > 2|V | − 3 = 7 edges and hence satisfy the constraintfor the minimal number of edges. If we remove the diagonal edge of the rectangle, then thesubgraph formed by the rectangle has only 4 edges and hence it will make the structurenon-rigid. However, if we remove the top edge of the lower triangle, then the graph willsatisfy the minimal rigidity conditions and will be a rigid graph.In the previous examples, we prefer to have the structures Fig.4.1 (b) or (c) in our socialnetwork since both of them are rigid. If our social network is of the form Fig.4.1(a), we wouldlike to introduce the two nodes in order to add the diagonal edge which makes the structurerigid. The pebble game algorithm proposed in [28] is a method to do the same analysiswe did above by assigning pebbles to a selected set of vertices initially. It is a polynomialtime algorithm which is based on the Laman’s theorem. In practice, such polynomial timealgorithms can be utilized to do the same analysis we did above, especially when the network394.4. Summaryis of large size.4.4 SummaryThis chapter proposed a solution to the problem of interest influencing in online socialnetworks. The problem was first formulated as a classic problem in rigidity theory. Then,the Laman’s theorem which completely characterizes a minimally rigid graph in 2 dimensionalspace was utilized to find the edges that should be added to make a rigid graph based onthe assumption that the interests of the nodes are 2 dimensional vectors. This assumptionwas required due to lack of a Laman’s type of a theorem to analyze rigidity of frameworksusing only the combinatorial properties of the underlying graph.40Chapter 5Conclusions and Future WorkThis thesis explored three important problems related to the online social networks utilizingtools from statistical signal processing and applied mathematics.The second chapter presented a comparison of the maximum likelihood estimates of thenetwork degree distributions from the sampled network degree distributions for two cases:1) the underlying network has an exponential degree distribution (2) the underlying net-work has a power-law degree distribution. Assuming that the noise introduced from sam-pling/measurement process follows a zero mean Gaussian distribution, maximum likelihoodestimates were derived for both models and their variances were analyzed using the Crame`r-Rao Bounds and mean squared errors. Using the results of these theoretical and numericalcomparisons, it was concluded that the maximum likelihood estimation has a significantlysmaller variance (approximately an order of magnitude) for exponential model compared tothe power-law model in the respective regions of interest. The reason for this interestingbehaviour was then conjectured to be the Second Order Stochastic Dominance of probabil-ity distributions. This was illustrated by considering a situation where the two distributionshave the same mean.Devising a formal proof for the presented conjecture involving the Second Order Stochas-tic Dominance would be an interesting research direction for this work. Another interestingresearch direction of the presented work would be to compare the model based estimationof the degree distributions for different random graph models.The third chapter proposed a solution to the problem of interest identification in socialnetworks. The main reason for this problem to be non-trivial is the fact that only the41Chapter 5. Conclusions and Future Workpartial graph and the interests of a subset of nodes are known. Previous research has shownthat there exists a relation between some graph theoretic distance metric and the differenceof interests (Euclidean distance) of each pair of nodes due to the Homophily behaviour ofsocial networks. This idea was utilized to obtain a composite function which gives the noisyinterest distances from d+1 known nodes (in a d dimensional space) to a missing node. Then,equality constraints derived using Cayley-Menger Determinants were used to formulate anoptimization problem in order to minimize noise and find the d+ 1 distances to the missingnode which can then be used to triangulate the interest associated with the missing node.An interesting future research direction of the presented work would be to extend theproposed framework to random environments where the parameters associated with nodesare random vectors and/or the underlying spatial structure is a random graph.The fourth chapter considered the problem of interest influencing in an online socialnetwork i.e. how to create a network structure in order to gain a maximum influentialpower over the interests of people in the network? This problem was formulated as a classicproblem in rigidity theory assuming the contagion behaviour of individuals in the network.This reduces the problem of interest influencing to finding the edges that should be addedto the network to make it a minimally rigid graph. Then, the Laman’s theorem whichcompletely characterizes a minimally rigid graph in two dimensional space was utilized tofind the edges that should be added to make a rigid graph based on the assumption thatthe interests of the nodes are two dimensional vectors. This assumption was required dueto lack of a Laman’s type of a theorem to analyze rigidity of frameworks using only thecombinatorial properties of the underlying graph.In future work, we intend to utilize real world data to further explore the theoreticalframework proposed in this work and also to extend this work to random environments.Further, it would be interesting to explore how the rigidity theory concepts of body-barframeworks (instead of bar and joint frameworks) can be related to social networks.42Bibliography[1] Facebook, Inc. (2016, Oct. 10). Company information: Stats [Online]. Available:https://newsroom.fb.com/company-info/[2] Twitter, Inc. (2016, Jun. 30). Twitter usage [Online]. Available: https://about.twitter.com/company[3] J. Aspnes, T. Eren, D. K. Goldenberg, A. S. Morse, W. Whiteley, Y. R. Yang, B. An-derson, and P. N. Belhumeur, “A theory of network localization,” IEEE Transactionson Mobile Computing, vol. 5, no. 12, pp. 1663–1678, Dec. 2006.[4] H. Bauke, “Parameter estimation for power-law distributions by maximum likelihoodmethods,” The European Physical Journal B, vol. 58, no. 2, pp. 167–173, Jul. 2007.[5] L. M. Blumenthal, Theory and Applications of Distance Geometry. Chelsea House Pub,1970.[6] P. E. Caines, Linear Stochastic Systems. John Wiley & Sons, 1988.[7] M. Cao, B. D. O. Anderson, and A. S. Morse, “Sensor network localization withimprecise distances,” Systems & Control Letters, vol. 55, no. 11, pp. 887–893, Nov.2006.[8] K. Choroman´ski, M. Matuszak, and J. Mikisz, “Scale-free graph with preferentialattachment and evolving internal vertex structure,” Journal of Statistical Physics,vol. 151, no. 6, pp. 1175–1183, Jun. 2013.43Bibliography[9] N. A. Christakis and J. H. Fowler, “Social contagion theory: examining dynamic socialnetworks and human behavior,” Statistics in medicine, vol. 32, no. 4, pp. 556–577 ,Feb. 2013.[10] A. Clauset, C. R. Shalizi, and M. E. Newman, “Power-law distributions in empiricaldata,” SIAM review, vol. 51, no. 4, pp. 661–703, Nov. 2009.[11] H. Daneshmand, M. Gomez-Rodriguez, L. Song, and B. Schoelkopf, “Estimatingdiffusion network structures: Recovery conditions, sample complexity & soft-thresholding algorithm,” in Proceedings of the International Conference on MachineLearning, ICML’14, 2014, pp. 793-801.[12] W. Deng, W. Li, X. Cai, and Q. A. Wang, “The exponential degree distributionin complex networks: Non-equilibrium network theory, numerical simulation andempirical data,” Physica A: Statistical Mechanics and its Applications, vol. 390, no. 8,pp. 1481–1485, Apr. 2011.[13] T. Eren, O. Goldenberg, W. Whiteley, Y. Yang, A. Morse, B. Anderson, andP. Belhumeur, “Rigidity, computation, and randomization in network localization,” inTwenty-third Annual Joint Conference of the IEEE Computer and CommunicationsSocieties, INFOCOM’04, 2004, pp. 2673–2684.[14] R. Eyal, S. Kraus, and A. Rosenfeld, “Identifying missing node information in socialnetworks,” in Proceedings of the Twenty-Fifth Conference on Artificial Intelligence,AAAI’11, 2011, pp. 1166–1172.[15] R. Eyal, A. Rosenfeld, S. Sina, and S. Kraus, “Predicting and identifying missing nodeinformation in social networks,” ACM Transactions on Knowledge Discovery fromData, vol. 8, no. 3, pp. 1–35, Jun. 2013.[16] O. Frank, “Estimation of the number of vertices of different degrees in a graph,”Journal of Statistical Planning and Inference, vol. 4, no. 1, pp. 45–50, Jan. 1980.44Bibliography[17] O. Frank, “A survey of statistical methods for graph analysis,” Sociological methodology,vol. 12, pp. 110–155, Jan. 1981.[18] A. Freno, G. Garriga, and M. Keller, “Learning to recommend links using graphstructure and node content,” in Neural information processing systems workshop onchoice models and preference learning, 2011.[19] O. N. Gharehshiran, W. Hoiles, and V. Krishnamurthy, “Detection of homophiliccommunities and coordination of interacting meta-agents: A game-theoretic viewpoint,”IEEE Transactions on Signal and Information Processing over Networks, vol. 2, no. 1,pp. 84–101, Mar. 2016.[20] M. Gjoka, M. Kurant, C. Butts, and A. Markopoulou, “Walking in facebook: A casestudy of unbiased sampling of osns,” in 29th Conference on Computer Communications,INFOCOM’10, 2010, pp. 1–9.[21] S. Goel and D. G. Goldstein, “Predicting Individual Behavior with Social Networks,”Marketing Science, vol. 33, no. 1, pp. 82–93, Oct. 2013.[22] G. H. Golub and U. Matt, “Quadratically constrained least squares and quadraticproblems,” Numerische Mathematik, vol. 59, no. 1, pp. 561–580, Dec. 1991.[23] J. E. Graver, B. Servatius, and H. Servatius, Combinatorial Rigidity. AmericanMathematical Society, 1993.[24] M. Hamdi, V. Krishnamurthy, and G. Yin, “Tracking a markov-modulated stationarydegree distribution of a dynamic random graph,” IEEE Transactions on InformationTheory, vol. 60, no. 10, pp. 6609–6625, Oct. 2014.[25] M. S. Handcock and K. J. Gile, “Modeling social networks from sampled data,” TheAnnals of Applied Statistics, vol. 4, no. 1, pp. 5–25, Oct. 2010.[26] E. J. Hannan and M. Deistler, The statistical theory of linear systems. SIAM, 1988.45Bibliography[27] M. O. Jackson, “Networks in the Understanding of Economic Behaviors,” Journal ofEconomic Perspectives, vol. 28, no. 4, pp. 3–22, Nov. 2014.[28] D. J. Jacobs and B. Hendrickson, “An algorithm for two-dimensional rigiditypercolation: The pebble game,” Journal of Computational Physics, vol. 137, no. 2, pp.346 – 365, Nov. 1997.[29] S. M. Kay, Fundamentals of Statistical Signal Processing: Estimation Theory.Prentice-Hall, Inc., 1993.[30] U. A. Khan, S. Kar, and J. M. F. Moura, “Linear theory for self-localization:Convexity, barycentric coordinates, and cayley-menger determinants,” IEEE Access,vol. 3, pp. 1326–1339, Aug. 2015.[31] U. Khan, S. Kar, and J. Moura, “Distributed Sensor Localization in RandomEnvironments Using Minimal Number of Anchor Nodes,” IEEE Transactions on SignalProcessing, vol. 57, no. 5, pp. 2000–2016, May 2009.[32] M. Kim and J. Leskovec, “The network completion problem: Inferring missing nodesand edges in networks.” in Proceedings of the 2011 SIAM International Conference onData Mining, SDM’11, 2011, pp. 47–58.[33] V. Krishnamurthy, Partially Observed Markov Decision Processes. CambridgeUniversity Press, 2016.[34] V. Krishnamurthy, O. Gharehshiran, and M. Hamdi, Interactive Sensing and DecisionMaking in Social Networks, ser. Foundations and Trends in Signal Processing. NowPublishers, 2014.[35] G. Laman, “On graphs and rigidity of plane skeletal structures,” Journal of Engineeringmathematics, vol. 4, no. 4, pp. 331–340, Oct. 1970.46Bibliography[36] K. Lerman, X. Yan, and X. Z. Wu, “The “majority illusion” in social networks,” PloSone, vol. 11, no. 2, p. e0147617, Feb. 2016.[37] J. Leskovec and C. Faloutsos, “Sampling from large graphs,” in Proceedings of the 12thACM SIGKDD International Conference on Knowledge Discovery and Data Mining,KDD’06, 2006, pp. 631–636.[38] K. Libert. (2016, Feb. 23). Your Network’s Structure MattersMore than Its Size [Online]. Available: https://hbr.org/2016/02/your-networks-structure-matters-more-than-its-size[39] L. Lovsz, “Random walks on graphs: A survey,” 1993.[40] M. McPherson, L. Smith-Lovin, and J. M. Cook, “Birds of a Feather: Homophily inSocial Networks,” Annual Review of Sociology, vol. 27, no. 1, pp. 415–444, Aug. 2001.[41] S. Meyn and R. L. Tweedie, Markov Chains and Stochastic Stability, 2nd ed. CambridgeUniversity Press, 2009.[42] W. Ogryczak and A. Ruszczynski, “Dual stochastic dominance and related mean-riskmodels,” SIAM Journal on Optimization, vol. 13, no. 1, pp. 60–78, May 2002.[43] B. Ribeiro and D. Towsley, “Estimating and sampling graphs with multidimensionalrandom walks,” in Proceedings of the 10th ACM SIGCOMM conference on Internetmeasurement, 2010, pp. 390–403.[44] M. G. Rodriguez, D. Balduzzi, and B. Schlkopf, “Uncovering the temporal dynamicsof diffusion networks,” in Proceedings of the 28th International Conference on MachineLearning, ICML’11, 2011, pp. 561–568.[45] B. Servatius and H. Servatius, “Generic and abstract rigidity,” in Rigidity Theory andApplications. Springer, 2002, pp. 1–19.47Bibliography[46] C. R. Shalizi and A. C. Thomas, “Homophily and Contagion Are GenericallyConfounded in Observational Social Network Studies,” Sociological Methods &Research, vol. 40, no. 2, pp. 211–239, May 2011.[47] A. Singer and M. Cucuringu, “Uniqueness of low-rank matrix completion by rigiditytheory,” SIAM Journal on Matrix Analysis and Applications, vol. 31, no. 4, pp.1621–1641, Feb. 2010.[48] M. J. Sippl and H. A. Scheraga, “Solution of the embedding problem and decompositionof symmetric matrices.” Proceedings of the National Academy of Sciences of the UnitedStates of America, vol. 82, no. 8, pp. 2197–2201, Apr. 1985.[49] M. J. Sippl and H. A. Scheraga, “Cayley-menger coordinates,” Proceedings of theNational Academy of Sciences, vol. 83, no. 8, pp. 2283–2287, Apr. 1986.[50] A. Sljoka, “Counting for Rigidity, Flexibility and extensions via the pebble gamealgorithm,” Ph.D. dissertation, York University, Toronto, 2006.[51] F. Thomas, E. Ottaviano, L. Ros, and M. Ceccarelli, “Performance analysis of a 3-2-1pose estimation device,” IEEE Transactions on Robotics, vol. 21, no. 3, pp. 288–297,Jun. 2005.[52] R. Van Der Hofstad. (2016, May 01). Random graphs and complex networks [Online].Available: http://www.win.tue.nl/rhofstad/NotesRGCN.pdf[53] W. Whiteley, “Rigidity of Molecular Structures: Generic and Geometric Analysis,” inRigidity Theory and Applications. Springer, 2002, pp. 21–46.[54] Y. Zhang, E. D. Kolaczyk, and B. D. Spencer, “Estimating network degree distributionsunder sampling: An inverse problem, with applications to monitoring social medianetworks,” Annals of Applied Statistics, vol. 9, no. 1, pp. 166–199, Mar. 2015.48[55] Z. Zhang and Y. Huang, “A projection method for least squares problems with aquadratic equality constraint,” SIAM Journal on Matrix Analysis and Applications,vol. 25, no. 1, pp. 188–212, Jan. 2003.49Appendix AMethods of Sampling Networks: ABrief SurveyIn this Appendix, we provide a brief survey of the literature about the widely used methodsfor network sampling. A detailed description of each method can be found in the referencesprovided.A.1 Sampling by Random Node SelectionA.1.1 Random Node SamplingThis is the most basic method that falls under the Random Node Selection category. As thename itself implies, this method works by selecting a set of nodes uniformly. Let the socialgraph (undirected) be G = (V,E) where V = {v1, v2, . . . , vN} and let the set of samplednodes be VS. We need |VS| ≤ Nmax as a constraint. Then the Random Node samplingalgorithm is as below.1. Generate X = Int(NU) + 1 where U ∼ Unif(0, 1) and N = |V |. (The Int(.) operatorextracts the integer part of a number)2. VS ←− VS ∪ vX3. If |VS| ≤ Nmax, go back to step 1.4. The sampled graph, GS is the graph induced by the set of nodes VS50A.1. Sampling by Random Node SelectionThe first step in the above algorithm transforms the uniform distribution in to a stepfunction (inverse transform approach). Many other variations of this method include meth-ods obtained by using weighted sampling instead of uniform sampling. One such exampleis the Random Degree Node (RDN) sampling where probability of a node being selected isproportional to its degree. However, this method will have a very high bias towards nodeswith higher degree [37]. Hence, this would not be a good choice if we want to estimate thedegree distribution. Further, most of the social networks randomly sample the User IDs ofthe users in order to obtain random node samples. This is undesirable in the case whereUser IDs are sparsely populated (large number of valid user IDs are not assigned) [43]. Insuch cases, we have to use overly complicated versions of this method to reduce the large hitto miss ratio that may occur due to the unpopulated User IDs.Further, it has also been proven that this method does not retain many interestingcharacteristics that are unique to social networks such as the power law in degree distribution[37].A.1.2 Breadth-First Sampling (BFS)Breadth-First sampling is another sampling algorithm that falls in to the category of samplingby random node selection. BFS uses two queues The steps of BFS algorithm are as below(using the same model as before).1. Begin two empty queus: Sampled and Processed2. Select vi ∈ V (seed) uniformly at random and insert it into Sampled and store theneighbours of vi in Processed3. Move the earliest explored node (last node in Processed) to the Sampled and insert allits neighbours (that are not already inserted) to Processed.4. Repeat step 3 until length of Sampled reach Nmax51A.2. Sampling by Random Edge SelectionAccording to [20], BFS is biased towards high degree nodes and also no statistical prop-erties can be theoretically proven for this method.A.2 Sampling by Random Edge SelectionSimilar to the idea of sampling by random node selection, the idea behind sampling byrandom edge selection is to select edges uniformly at random and then obtain the subgraphcreated by that set of edges as the sampled graph.However, a problem associated with this method in the context of social network samplingis that the sampled graph diameter will be extremely large (which does not agree with thesocial network community structure.). Since high degree nodes have more edges attachedto them, this method is also biased towards the high degree nodes. A variation of RandomEdge Selection which was proposed to eliminate this bias is Random Node Edge Sampling.In Random Node Edge Sampling, a node is first selected uniformly at random and then anedge incident to that node is picked uniformly at random [37]. However, if the edges donot have IDs (which is the practical scenario in social networks), then it may be hard andresource intensive to sample edges.A.3 Sampling by ExplorationA.3.1 Random Walk SamplingAs pointed above, both random node and edge sampling have certain weaknesses that areundesirable in the context of social networks. An alternate, computationally less expensivemethod which also reduces most of the weaknesses in previous methods is to sample a networkby means of a random walk [37, 43]. The Random Walk Sampling is explained next.Assume the social network in concern is an undirected graph G = (V,E) where V ={v1, v2, . . . , vn}. Let Xn be the random variable which gives the index of the node that the52A.3. Sampling by Explorationrandom walker is on at the time step n. Then, clearly {Xn} is a discrete time Markov chainandP(Xn+1 = j|Xn = i) =1d(i), if(i, j) ∈ E0, otherwise(A.1)where d(i) denotes the degree of the node vi. Hence, it can be seen that the Markov chainis also Homogeneous. In Random Walk (RW) sampling, we initiate a random walker at anode uniformly selected at random. Then, at each time step k, the walker moves from itscurrent position vi to vj and the edge (vi, vj) is added to the sequence of edges that havealready been sampled.Under certain conditions, we can use RW sampling to create estimators of graph charac-teristics that are asymptotically unbiased (converge to the actual characteristic value whenthe sample size goes to infinity). This follows from an important property of random walkscalled detailed balance which is proven below (This is mentioned without a proof in [39]).Theorem 3. A random walk on a connected graph has a unique stationary distribution thatsatisfies the detailed balance.Proof. Assume the graph is connected. Therefore, the random walk is irreducible (sincethere is a path from each node to every other node). Then, by the law of large numbersfor Markov chains, it has a unique stationary distribution which we denote by pi. We mustshow,Pijpi(i) = Pjipi(j) ∀i, j. We proceed with proof by cases.Case 1: (i, j) /∈ EThen Pij = Pji = 0 and therefore, the detailed balance equation is satisfied.Case 2: (i, j) ∈ E53A.3. Sampling by ExplorationLet pi be the vector that satisfies the detailed balance. Then,Pijpi(i) = Pjipi(j) ∀i, jpi(i)d(i)=pi(j)d(j)= k where d(.) denotes the degree of the vertex.Further, ∑i∈χpi(i) = 1.Therefore,k =1∑vi∈V d(i).Therefore, the Markov chain has a unique stationary distribution that satisfies the de-tailed balance.Further, it should be noted that the Random Walk Sampling on connected, non-bipartitegraphs sample edges uniformly at random when it is at the stationary distribution. In otherwords, sampling by random edge selection is similar to the random walk sampling at thestationary distribution. This is mentioned without proof in [43] and we present it as a formalstatement with a proof.Theorem 4. A random walk on a connected non-bipartite graph sample edges uniformly atrandom as the step size reaches infinity.Proof. Assume that the random walk is on a connected, non-bipartite graph. Then, fromTheorem 3, it has a unique stationary distribution. Further, since the graph is non-bipartite,the random walk is aperiodic. Then, by the Convergence of Marginals, the marginal distri-bution over nodes will reach the stationary distribution as the step size goes to infinity.54A.3. Sampling by ExplorationAssume the Markov chain is in stationary distribution at step n. Then, the probabilityof the random walker being in the node vj is(P Tpi)j =d(j)∑vi∈V d(i)(A.2)from Theorem 3.Then, the probability of the edge sampled at step n+1 being (j, k), denoted by pie(j −→ k)ispie(j −→ k) = d(j)∑vi∈V d(i)× 1d(j)(A.3)=1∑vi∈V d(i)(A.4)=12m(A.5)where m is the number of the undirected edges in the graph. Hence, the distribution overthe edges is uniform which concludes the proof.Next, with the aid of the above two theorems and a corollary of Law of Large Numbersfor Markov Chains which was stated in [41], we show how Random Walk sampling can beused to create asymptotically unbiased estimators of various graph characteristics (which isone of the advantages of Random Walk sampling over the other discussed methods).The following is a simpler version of the Theorem 17.2.1 of [41] (which is another versionof the law of large numbers) with a simpler proof (than the original one given in [41]) whichwe will use to create unbiased estimators using random walk sampling.Theorem 5. For a random walk {Xn}, on a connected graph,∑Ni=1 f(Xi)∑Ni=1 g(Xi)a.s.−→ Epi[f(X)]Epi[g(X)](A.6)55A.3. Sampling by Explorationwhere Epi(.) denotes the expectation with respect to the probability measure pi (stationarydistribution)Proof. Proof follows by noting that this is simply division of the terms of the law of largenumbers for Markov chains.[43] provides an extension of the above theorem as below with a full proof (which followstrivially from Theorem 5).Theorem 6. Let E∗ ⊆ E be nonempty. Let (ui, vi) be the ith sampled edge such that(ui, vi) ∈ E∗; and let B∗(B) be the number of such samples, where B is the number of totalrandom walk steps. Then, for any function f , where∑(u,v)∈E∗ |f(u, v)| ≤ ∞,limB→∞1B∗(B)B∗(B)∑i=1f(ui, vi)a.s.−→ 1|E∗|∑∀(u,v)∈E∗f(u, v) (A.7)It can be seen that the Theorem 6 (which is an extension of Theorem 5) can be utilizedto create asymptotically unbiased estimators. [43] also provides examples of four such esti-mators that estimates graph characteristics such as edge label density and global clusteringcoefficient by selecting the function f to suit the quantity that we are trying to estimate.Hence, it can be seen that sampling graphs (especially social networks) using randomwalk based methods has more advantages compared to the other two sampling methods interms of bias of the estimate and complexity.A.3.2 Frontier SamplingA recent advance in previously discussed Random Walk based methods in graph samplingis to use multiple dependent random walkers. This method, known as Frontier Sampling[43], can overcome the disadvantages of random walk sampling in cases where there are largenumber of disconnected components in the graph. In such cases, a single random walker56A.3. Sampling by Explorationmay get stuck in an isolated area and may provide estimates that represent only the saidisolated area, thus causing large estimation errors. Frontier sampling reduces this weaknessby employing multiple simultaneous random walkers to explore the graph, while retainingall the other nice properties of the random walk sampling methods [43].57