On uniform sampling of cliques by Massih Khorvash B.Sc., The University of Lethbridge, 2006 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in The Faculty of Graduate Studies (Computer Science) The University of British Columbia (Vancouver) October 2009 c Massih Khorvash 2009 Abstract The problem that we are addressing in this thesis is the problem of sampling uniformly the cliques of a target size k of any given graph. As a natural approach for solving this problem, we used the already available state-ofthe-art heuristic MAX-CLIQUE solvers. The heuristic MAX-CLIQUE algorithms, which we used for this task, have k-CLIQUE solvers as their subroutines. This thesis therefore examines how uniformly some of the stateof-the-art stochastic local search MAX-CLIQUE algorithms sample target cliques of graphs, and suggests various methods to improve their sampling performance. We also investigate the limitations of a generic method that uses already existing SAT samplers (such as XORSample and XORSample [1]) to sample the solutions of the translated SAT instances from k-CLIQUE instances. The main limitation with this method is the huge size of the encoded instances. Another important limitation with this method is that sampling satisfying assignments of the encoded SAT instances is expensive. We found that some state-of-the-art heuristic MAX-CLIQUE algorithms (DLS-MC [2] and PLS [3]) are already able to sample near-uniformly the target cliques of many of the instances used in this thesis. Furthermore, we gained some insights into various properties of these algorithms, which helped us to improve the sampling performance on the remaining instances. ii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Measures of sampling uniformity . . . . . . . . . . . . . . . . 1 4 5 2 Related work . . . . . . . . . . . . . . . 2.1 Heuristic MAX-CLIQUE algorithms 2.2 Near-uniform SAT samplers . . . . 2.3 Near-uniform CSP samplers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 10 15 17 3 Near-uniform sampling by translating k-CLIQUE 3.1 An encoding of k-CLIQUE to SAT . . . . . . . . . 3.2 Two effective methods of sampling . . . . . . . . . 3.2.1 XORSample algorithm . . . . . . . . . . . 3.2.2 XORSample algorithm . . . . . . . . . . . 3.3 Empirical results . . . . . . . . . . . . . . . . . . . 3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . to . . . . . . . . . . . . SAT . . . . . . . . . . . . . . . . . . . . . . . . 19 20 23 24 25 27 32 4 Heuristic MAX-CLIQUE algorithms as samplers 4.1 The DLS-MC algorithm . . . . . . . . . . . . . . . 4.2 Experimental setup . . . . . . . . . . . . . . . . . 4.3 Sampling performance of DLS-MC . . . . . . . . . . . . . . . . . 34 35 37 38 . . . . . . . . . . . . . . . . iii Table of Contents 4.4 4.5 4.6 The PLS algorithm . . . . . . . . . . . . . . . . . . . . . . . Sampling performance of PLS . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 44 46 5 Improvements to DLS-MC . . . . . . . . . . . . . . . . . . . . 48 5.1 Sampling uniformly and graph structures . . . . . . . . . . . 48 5.1.1 Testing Hypothesis 1 on DIMACS graphs . . . . . . . 49 5.1.2 Testing Hypothesis 1 using a family of small instances 53 5.1.3 Not all cliques of size k − 1 have effect on the amount of uniformity . . . . . . . . . . . . . . . . . . . . . . . 55 5.1.4 Modifying a DIMACS MAX-CLIQUE instance, p hat7001, to test Hypothesis 1 . . . . . . . . . . . . . . . . . 55 5.2 Sampling uniformly and pd . . . . . . . . . . . . . . . . . . . 58 5.3 Sampling uniformly and the use of a Monte-Carlo procedure integrated with DLS-MC . . . . . . . . . . . . . . . . . . . . 64 5.4 Searching for other methods of improvement . . . . . . . . . 74 5.5 Graph structures with no useful information for sampling near-uniformly . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6 Conclusions and future work . . . . . . . . . . . . . . . . . . . 81 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 iv List of Tables 3.1 3.2 3.3 4.1 4.2 5.1 5.2 5.3 The size of encoded DIMACS MAX-CLIQUE graphs as SAT instances using the encoding of Section 3.1. The relatively small instances are shown in bold. . . . . . . . . . . . . . . . Sampling two of the encoded DIMACS MAX-CLIQUE graphs using XORSample. . . . . . . . . . . . . . . . . . . . . . . . . Sampling the satisfying assignments of two of the DIMACS graphs encoded as SAT instances using XORsample . . . . . Sampling performance of DLS-MC with pd values taken from [2]. The instances that are sampled non-uniformly with respect to some measures of uniformity are shown in bold. The values for which the instances are considered as non-uniform are also shown in bold. Any number < 0.001 is denoted by . Each value after ± represents the standard deviation from the value of EMPN /E(n). . . . . . . . . . . . . . . . . . . . Sampling performance of PLS. The instances that are sampled non-uniformly with respect to some measures of uniformity are shown in bold. The values for which the instances are considered as non-uniform are also shown in bold. Any number < 0.001 is denoted by . Each value after ± represents the standard deviation from the value of EMPN /E(n). The result of sampling the solutions of SG with 60 hairs attached to Hairy using XORSample. The length of XOR clauses is 40. . . . . . . . . . . . . . . . . . . . . . . . . . . . Sampling performance of DLS-MC with various pd values for the instance with 60 hairs from Subsection 5.1.2. . . . . . . . Sampling performance of DLS-MC with various pd values for the modified p hat700-1 instance with 7 new cliques of size 10 from Subsection 5.1.4. . . . . . . . . . . . . . . . . . . . . 29 31 31 43 47 54 60 60 v List of Tables 5.4 Sampling performance of DLS-MC with the best pd value with respect to normalized entropy and EMPN /E(n) from the following pd values 1, 5, 10, 15, 20, 25, 30, 35, 40, 45 and 50. The values in parenthesis are the result of sampling the solutions taken from Table 4.1, with a pd value that minimizes the time required to sample each target clique. . . . . . . . . 5.5 Sampling performance of DLS-MC integrated with a MonteCarlo procedure. The ‘# of selections’ is the average number of selections reported in [2], required for DLS-MC with the pd values in [2] to find a target clique. . . . . . . . . . . . . . 5.6 A matrix that represents the amount of connectivity among the solutions of brock200-3. Each row and column represent one of the cliques. . . . . . . . . . . . . . . . . . . . . . . . . 5.7 A matrix that represents the amount of connectivity among the solutions of brock200-4. Each row and column represent one of the cliques. . . . . . . . . . . . . . . . . . . . . . . . . 5.8 Sampling performance of the method that integrates a MonteCarlo procedure with DLS-MC with temperature of 1 for the instance with 60 hairs from Subsection 5.1.2. The best result of Table 5.2 with respect to EMPN /E(n) is shown within parenthesis on the first row. The best result of Table 5.2 with respect to normalized entropy is shown within parenthesis on the second row. Table 5.2 represents the sampling performance of DLS-MC with various pd values on the same instance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.9 Sampling performance of the method that integrates a MonteCarlo procedure with DLS-MC for the modified p hat700-1 instance with 7 new cliques of size 10 from Subsection 5.1.4. The best result of Table 5.3 is shown within parenthesis. Table 5.3 represents the sampling performance of DLS-MC with various pd values on the same instance. . . . . . . . . . . . . 5.10 The adjacency matrix of the graph of Figure 5.11 added to the identity matrix. . . . . . . . . . . . . . . . . . . . . . . . . 5.11 The adjacency matrix of the graph of Figure 5.11 raised to the power 3 (i.e. target clique size). . . . . . . . . . . . . . . . 1 The number of edges, vertices and target cliques, and target clique size of each DIMACS MAX-CLIQUE instances used in this thesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 68 70 71 73 74 78 78 98 vi List of Figures 3.1 A graph with a maximum clique of size 3. . . . . . . . . . . . 4.1 Cumulative probability distribution of relative entropies across different graphs induced by DLS-MC. . . . . . . . . . . . . . 39 Cumulative probability distribution of EMPN /E(n) across different graphs induced by DLS-MC. . . . . . . . . . . . . . 40 Cumulative probability distribution of 100 values of EMPN /E(n) induced by 100 experiments using DLS-MC for the instance brock200-4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Cumulative probability distribution of 100 values of EMPN /E(n) induced by 100 experiments using DLS-MC for the instance p-hat300 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2 4.3 4.4 5.1 5.2 5.3 5.4 5.5 Depiction of a graph with two maximum (target) cliques c1 and c2 . Circles c1 and c2 represent the target cliques of the graph. Each circle represents a clique of size less than or equal to the cliques c1 and c2 that share some vertices with them. Each bar represents the number of cliques of size 20 of the instance brock200-1 for which their corresponding b1 -differences fall within the range of numbers specified on the x axis. For details, see text. . . . . . . . . . . . . . . . . . . . . . . . . . Each bar represents the number of cliques of size 20 of the instance brock200-1 for which their corresponding r1 -differences fall within the range of numbers specified on the x axis. For details, see text. . . . . . . . . . . . . . . . . . . . . . . . . . A graph from the family of graphs SG with 4 hairs connected to one of its cliques of size 3. . . . . . . . . . . . . . . . . . . The effect of increasing the number of hairs connected to the vertices of Hairy on the amount of uniformity of the sampled target cliques. As the number of hairs increases, the normalized entropy decreases. . . . . . . . . . . . . . . . . . . . . . . 22 49 51 52 53 54 vii List of Figures 5.6 Impact of cliques of sizes smaller than the target size that are not connected to any of the target cliques on the amount of uniformity of sampled target cliques. There are 60 vertices connected to Hairy, and there are 200 vertices connected to the vertex with the label t. . . . . . . . . . . . . . . . . . . . 5.7 Impact of increasing the number of cliques of sizes smaller than the target cliques connected to one of the target cliques on the amount of uniformity of the sampled solutions. The pink line shows the trend of changes in the relative frequency of the target clique as cliques of size 10 are being attached to it. The grey line shows the corresponding changes in the relative frequency of the other clique of size 11 as cliques of size 10 are being attached to the other cliuqe. . . . . . . . . . 5.8 The effect of changing the value of pd on the amount of entropy of the sampled solutions for the instance brock200 1. . . 5.9 The effect of changing the value of pd on the amount of entropy of the sampled solutions for the instance brock400 4. . . 5.10 The effect of changing the value of pd on the amount of entropy of the sampled solutions for the instance brock200 3. . . 5.11 A graph with a target clique of size 3. . . . . . . . . . . . . . 5.12 The graph of Figure 5.9 after two of its edges are pruned using the method proposed in Section 5.4. . . . . . . . . . . . . . . 56 57 64 65 66 77 78 viii Acknowledgements It would be a long list to thank all the magnificent people who have been inspiration and have encouraged me to continue my education. However, there are some who ought to be acknowledged for their direct influence during my Master of Science at the University of British Columbia. I owe my deepest gratitude to Dr. Holger Hoos and Dr. David Kirkpatrick, my research supervisors. I immensely enjoyed our long and enlightening discussions. I am indebted to Dr. Hoos for providing the idea of this project, and his continuous close supervision with his valuable insights. I also would like to show my appreciation to him for providing me with the code for DLS-MC algorithm. It was an honor to work under the supervision of Dr. Kirkpatrick. His patience, tireless support and constant availability is much appreciated. Without their exceptional guidance the writing of this thesis would not have been completed. This experience will be a major asset for my future studies. It is also a pleasure to thank Dr. Raymond Ng for his general goodnaturedness and for taking the time to be my second reader. A special thank to Dr. Wayne Pullan for making the code for the PLS algorithm available. I would like to thank Frank Hutter for proofreading many sections of my thesis. I also would like to thank my friends and colleagues at the Beta lab. They include Chris Thachuk, Lin Xu, Jonathan Backer, Dave Tompkins, Jeff Sember, Mohmmad Safari, Baharak Rastegari, Hosna Jabbari, Andrew Carbonetto, Mirela Andronescu and Chris Fawcett. I also would like to use this opportunity to thank Dr. Yllias Chali, Dr. Wolfgang Holzmann, Dr. Hadi Kharaghani and Dr. Stephen Wismath whom provided recommendation letters for my application to the University of British Columbia. I would like to extend my sincerect thanks for their continuous love and support from my friends Majid Safataj, Farhad Faradji, Mohmmad Abdul-Amir, Mahdi Tayarani, Ehsan Khajeh, Pooyan Fazli, Zeinab Abaasi, Mohmmad Ghazi-Asgar, Anseok Joo and Hao Ren. I am also greatly thankful to my family, Maman, Baba, Maryam, Zahra, Alireza and Charlene Janes who always supported, encouraged and believed ix Acknowledgements in me. ‘all praise is due to our God, the lord of the world’ x Dedication To my mom who is greater than a beautiful sunrise and my dad who is greater than the ocean. xi Chapter 1 Introduction There are many problems concerning the interactions amongst the elements of a set, which can be naturally represented as a graph. A subset of these elements where every part of elements interacts is called a clique. The problem of deciding whether there is a clique of a specified size in a given graph is denoted by k-CLIQUE. The problem of k-CLIQUE is one of the early combinatorial problems shown to be N P-complete [4]. A variant of it, MAX-CLIQUE, is the problem of searching for the largest clique, which is N P-hard [5], and has many applications in areas such as information retrieval, experimental design, signal transmission, computer vision and bioinformatics [6–8]. Typical state-of-the-art heuristic MAX-CLIQUE solvers can be used to find cliques of any given size k of graphs quickly, where k is less than or equal to the largest known size of cliques for these graphs [2, 3, 9]; however, some applications need a variety of these target cliques, for instance, knowledge discovery in which an element may belong to more than one category (represented as a clique) [10], computer vision [11] and bioinformatics [12]. Perhaps the clearest example is in bioinformatics, discovering the interactions among proteins. Proteins are large organic components, which are essential parts of organisms with different biological functions in every process of cells. Proteins can bind to other proteins to form protein complexes. These simple proteins have many similar reactions in a biological function. One approach towards discovering biological functions of a protein is considering the known proteins that interact with this protein, as proteins that interact with each other perform a similar biological function [13]. The interactions of proteins in an organism can be represented as a protein-protein interaction (P P I) network, where proteins are expressed by the nodes and an edge represent the interaction between two proteins. The verified PPI networks of several organisms (such as yeast) are available today. It is believed that these PPI networks have certain topological features. Some mathematical network generators that produce networks with topological features similar to those of the available PPI networks have been developed. Two of such models are duplication and preferential attach1 Chapter 1. Introduction ment. Both of these models start with an initial graph called seed and grow the network in a sequence of steps [12]. Hormozdiari et al. showed that several key topological features (degree distribution, k-hop reachability, graphlet, betweenness and closeness distributions1 ) of these PPI networks depend heavily on the specific model and the seed graph used. The ‘right’ seed graph is typically a dense subgraph of the analyzed PPI network [12]. For example, Hormozdiari et al. picked a maximum clique of size 10 of the yeast PPI network along with a clique of size 7 and added some edges between them to enrich the initial seed in order to produce a network similar to the yeast PPI network. Therefore, it is reasonable to assume that if we can uniformly sample the maximum cliques of an analyzed PPI network, we may be able to provide a variety of ‘good’ seeds to those models to generate PPI networks with different topological features. It is the best to use the solvers designed to find almost complete subgraphs, quasi-cliques, rather than maximum cliques for such applications; however, the methods proposed for sampling near-uniformly the maximum cliques of graphs as we will see are well applicable to some of the algorithms for finding quasi-cliques, as these quasi-clique algorithms are extended from heuristic MAX-CLQUE solvers [14]. Exact MAX-CLIQUE algorithms, for example the Bron-Kerbosch algorithm [15], are guaranteed to return a maximum clique, but its time grows exponentially with the number of vertices of the graph. Unfortunately, MAX-CLIQUE is inapproximable in polynomial-time within a factor |V |1− , for any > 0, unless N P = ZPP 1 [16]. The best polynomial-time approximation algorithm for MAX-CLIQUE achieves an approximation ratio of O(|V |/(log |V |)2 ) [17]. Because of the computational complexity of exact algorithms and the fact that MAX-CLIQUE is hard to approximate, interest has shifted towards efficient heuristics for which no formal guarantee is provided, but which achieve good empirical results anyways [18]. Examples include DLS-MC [2], PLS [3] and an algorithm by Grosso et al. [9]. As stated before, some applications may need more than one target clique of a graph. One may suggest finding all the target cliques of a given graph; 1 The k-hop reachability of a node is the number of distinct nodes it can reach via a path of ≤k edges. Graphlets are small graphs such as triangles, stars, or cliques. Betweenness of a node v is the number of shortest paths between any pair of nodes u and w that pass through v, normalized by the total number of such paths. Closeness of v is the inverse of the total distance of v to all other nodes u. 1 ZPP is the class of problems that can be solved in expected polynomial time by a probabilistic algorithm with zero error probability. 2 Chapter 1. Introduction however, there are some issues with this approach. The first problem is that there could be an exponential number of target cliques in graph to enumerate. The other problem is that even if there are not exponential number of target cliques to enumerate, this approach may still not be feasible as Valiant and Vazirani showed that the number of solutions should not affect the difficulty of determining the existence of any solution for the problem instances in the worst case, which means in the worst case we may not even be able to partially enumerate the target cliques of a graph with lots of target cliques [19]. Of course, one may try to maximize the number of target cliques enumerated in a specified period of time, but the enumeration algorithms may not guarantee that the target cliques are generated uniformly at random. One way of reaching the goal of optimizing the number of target cliques of a given graph enumerated in a period of time is to efficiently sample the target cliques of the graph near-uniformly. The focus of this thesis is to study various methods for uniform sampling of the target cliques of graphs, and to suggest various ways to improve their degree of uniformity. The rest of this thesis is structured as follows. In the remainder of this introductory chapter, we provide some background information on the problem of k-CLIQUE, and discuss the notion of uniformity and some ways of measuring uniformity. In Chapter 2, a summary of literature review on counting, enumeration and sampling of combinatorial spaces with a focus on MAX-CLIQUE is provided. In Chapter 3, we first describe an encoding of kCLIQUE as SAT. We then use an already known near-uniform SAT sampler to sample the satisfying assignments of the encoded SAT instances. Finally, the sampled satisfying assignments are translated back to cliques. We show that this is not a practical approach, because of mainly the huge size of the translated instances. The other reason for impracticality of this method is that the samplers were not able to sample the satisfying assignments of the encoded instances as efficiently as a native MAX-CLIQUE algorithm, DLSMC. Next in Chapter 4, we use two MAX-CLIQUE algorithms, DLS-MC and PLS, that are capable of sampling near-uniformly many of the chosen instances. PLS is easier to use as it requires no adjustment of parameters, but its sampling performance is not as good as DLS-MC. DLS-MC has a parameter that can have a significant impact on sampling performance. We then in Chapter 5 suggest different methods to improve the sampling performance of DLS-MC. Two of these methods are able to improve the amount of uniformity of the sampled target cliques of the instances that were not sampled near-uniformly by DLS-MC. In Chapter 6, we summarize the main insights gained from our study, and outline some directions for 3 Chapter 1. Introduction future research. 1.1 Background The formal definitions of the problems mentioned in this thesis are as follows. Definition 1 k-CLIQUE: Input: An undirected graph G = (V, E), where V is the set of all vertices and E the set of all edges, and an integer k: G, k . Each edge connects exactly two different vertices. Output: A yes or no answer to whether the graph G has a clique of size k. A variant of the problem of k-CLIQUE produces a clique of size k as its output if the output of k-CLIQUE is yes. We dubbed such a problem as constructive k-CLIQUE. As most of the algorithms for solving k-CLIQUE also provide a clique of size k as a certificate, in this thesis we use k-CLIQUE to refer to the problem of constructive k-CLIQUE. Definition 2 Target clique: A clique of a specified size k is called a target clique. Another problem that is related to the problem that we consider in this thesis is the problem of MAX-CLIQUE. The problem of MAX-CLIQUE can be stated formally as follows. Definition 3 MAX-CLIQUE: Input: An undirected graph G, where V is the set of all vertices and E the set of all edges: G Output: The size k of maximum size cliques of G. Constructive MAX-CLIQUE would be the problem of MAX-CLIQUE when the output is one of the maximum size cliques. We use MAX-CLIQUE in this thesis to refer to the problem of constructive MAX-CLIQUE as the literature does so [2, 3, 9, 18]. The problem that we are addressing in this thesis is the problem of sampling uniformly the cliques of a given size k of any given graph - that is we are not trying to optimize the value of k, but to optimize the variety of discovered cliques of size k at low cost. We dubbed this problem as S-CLIQUE, which is read as sample clique. As the problem of MAX-CLIQUE is N P-hard [5] and is also hard to approximate [16, 17], interest has shifted towards efficient heuristics for which 4 Chapter 1. Introduction no formal guarantee is provided, but which achieve good empirical results anyways [18]. Some heuristic MAX-CLIQUE solvers use a k-CLIQUE solver as their subroutine. The value of k for the k-CLIQUE subroutine solvers is increased incrementally to search for the largest possible cliques of graphs in a specified period of time. Typical state-of-the-art heuristic MAX-CLIQUE solvers therefore can be used to find cliques of any given size k of graphs quickly, where k is less than or equal to the largest known size of cliques for these graphs [2, 3, 9]. As a natural approach for solving S-CLIQUE, we used the already available state-of-the-art heuristic MAX-CLIQUE solvers. We then are suggesting various ways of improving the amount of uniformity of the distribution of the sampled target cliques induced from multiple runs of these MAX-CLIQUE solvers. As the large cliques of graphs seem to be particularly useful in various applications [6–8], we are studying the amount of uniformity among the sampled cliques of the largest known sizes. The values of k for many of the graphs used in this thesis correspond to the respective maximum clique sizes [2]. There is however no reason to believe our methods depend on these particular values and may be used for sampling various cliques of size k to the problem of k-CLIQUE. 1.2 Measures of sampling uniformity The k-CLIQUE algorithms can be classified into two classes based on the number of solutions they generate in a run. A typical k-CLIQUE algorithm generates one target clique per run. A k-CLIQUE algorithm may have the option to report more than one solution per run. The degree of amount of uniformity of the sampled target cliques may be calculated differently for these two types of sampling procedures. The measures introduced in this section are for the algorithms that report one target clique per run. The algorithms used in this thesis are designed to report one sampled solution at the end of each of the independent runs with an exception of the algorithm from Section 5.3. Multiple runs of a sampler induce a probability distribution. We will use pi to denote the probability of each target clique. This probability distribution on the space of target cliques is uniform if all of the solutions are found with equal probability. The algorithm producing such a uniform distribution is called a uniform sampler. It seems also worth noting that probably depending on the application one might value a target clique more if it has small overlap with the previously found target cliques (or vice versa) which will affect the types of 5 Chapter 1. Introduction uniformity one may look for. In this thesis, we consider two target cliques different, if they even differ by a single vertex. A natural approach for sampling the target cliques of a graph is to use the already available state-of-the-art MAX-CLIQUE algorithms. A priori, there is no reason to expect that even randomized versions of these algorithms to sample the target cliques of a given graph completely uniformly. Therefore, we need to choose some sort of measurements to assess their degree of uniformity. A widely used measurement of uniformity is Shannon’s entropy [20]. The entropy of a probability distribution over a set of n target cliques of a graph is defined as n − {pi × log2 (pi )} i=1 where pi is the probability of sampling the ith target clique. If the sampler does not sample some of the target cliques at all, then pi ×log2 (pi ) is defined to be zero for them. Entropy can be seen as the degree of uncertainty of the outcomes of an experiment. The entropy of a fair coin is 1; however, a biased coin has a lower entropy as the uncertainty between the two outcomes is decreased. Entropy of a fair die is 2.585, but as a die becomes biased towards some of its faces, its entropy decreases. Entropy is not informative if we do not know the number of possible outcomes. A descriptive example is as follows. Example 1 If the entropy of an n-sided die is 5.807, the value of entropy alone does not give us any information about how biased the die is. If the number of faces of the die is 58, then entropy of 5.807 means the die is a fair one; however, for a 100 sided die entropy of 5.807 induces a biased die. As it is not possible to determine how far away we are from the ideal uniform distribution by just looking at the value of entropy, we considered normalizing the value of entropy by dividing it by the maximum possible entropy of the distribution under consideration, which is the logarithm of the total number of possible outcomes. We define normalized entropy as − n i=1 {pi × log2 (pi )} log2 (n) where n is the total number of target cliques of the instance and pi is the probability of sampling the ith target clique. Note that the values of normalized entropy are in the range [0, 1]. 6 Chapter 1. Introduction Perhaps the best way to describe a weakness of normalized entropy is by some examples. Example 2 The normalized entropy is still quite large in the case when there are 1 000 solutions and half of which are found by a solver with the probability of 1/500 and the rest with probability 0. The normalized entropy of this outcome is −( 500 i=1 {1/500 × log(1/500)} + log(1 000) 500 i=1 {0 × 0)}) = 0.899 We can have a different probability distribution with the same normalized entropy as in the above case (0.899). Let the number of solutions be 1 000 again. If 770 of the solutions are sampled with probability of 1/3600 and the rest with probability of 1/292.579, the normalized entropy is again 0.899. The measurement of normalized entropy can not distinguish between the two distributions given in Example 2. Because normalized entropy fails to reflect the existence of outlier solutions, we use another measurement dubbed FRC(w), which measures the 1 fraction of sampling probabilities that fall in the interval [ w·n ,w n ]. Using this measurement we can bound the sampling probabilities. We choose the value of w such that the difference between the maximum 1 sampling probability and the minimum sampling probability is at most 10 . The latter condition is satisfied if the value of w is √ n+ n2 +400n . 20 √ n+ n2 +400n . 20 Let IN T 10 denote FRC(w) with w = IN T 10 is able to distinguish between the latter two distributions in Example 2. However, it is possible to have different probability distributions with respect to other measures of uniformity but the same IN T 10 value. Therefore, normalized entropy and IN T 10 are complimentary uniformity measurements. We also use another measurement which is the number of trials needed till the k-CLIQUE sampler finds each target clique of the given graph at least once, normalized by dividing by the corresponding expected theoretical number of trials needed for an ideal uniform sampler. Theoretically, it is shown that an ideal uniform sampler takes E(n) = n·Hn to sample each of the target cliques at least once, where Hn is the nth harmonic number [21]. Let EMPN be the empirical variant of E(n). The value of EMPN is evaluated as the average of number of runs that the sampler requires to sample each solution at least once for 100 independent experiments. We use EMPN /E(n) as a measurement of uniformity. The 7 Chapter 1. Introduction value of EMPN /E(n) can vary from a number close to zero to a very large number. The expected theoretical number of steps required for a uniform sampler, E(n), to sample each target clique of a graph depends on the number of target cliques of the graph, n. However, the ratio between the empirical number of steps of a sampler to report each target clique at least once, EMPN , and the theoretical value for a uniform sampler, E(n), is independent of the number of target cliques when EMPN /E(n) is 1. It is important to note that EMPN /E(n) also is not by itself a perfect measure of uniformity. For example, an algorithm that samples the target cliques in a fixed sequence repetitively one after the other is not sampling the solutions uniformly at random but its value of EMPN is n, where n is the total number of solutions. In this example, the value of normalized entropy is not close to 1 but its value of EMPN /E(n) is fewer than 1. We are measuring uniformity only over those solutions that are discovered in 100 000 runs of the sampler as we do not know the actual number of solutions of each instance. The value of n is the number of distinct solutions sampled in the 100 000 runs. The sampling probability, pi , is consequently the relative frequency of the ith target clique sampled. As each of these measurements capture different aspects of the induced distribution, we use all of them (normalized entropy, IN T 10, and EMPN /E(n)). 8 Chapter 2 Related work A natural approach for solving the problem of S-CLIQUE would be to use the already available MAX-CLIQUE samplers, as typical state-of-the-art heuristic MAX-CLIQUE solvers can be used to find cliques of any given size k of graphs quickly, where k is less than or equal to the largest known size of cliques for these graphs [2, 3, 9]. While there has been much attention paid to the problem of MAXCLIQUE [2, 3, 9, 18], to the best of our knowledge the problems of counting, enumerating and sampling uniformly the target cliques of a graph has not been previously addressed. The problems of counting, enumerating and sampling as we will see are related to each other. An obvious way of sampling target cliques of a graph is to enumerate them first, and sample uniformly at random from the set of enumerated solutions. As mentioned in Chapter 1, there are two problems related to this approach, possible exponential number of solutions and the difficulty of finding even one target clique. Another approach for sampling is using an already available counting procedure [1]. The problem of finding all of the target cliques of a graph is at least as hard as the problem of counting its number of target cliques (#MAXCLIQUE), since we may be able to count the total number of solutions but not being able to enumerate them efficiently due to the existence of exponential number of solutions (particularly if the number of solutions is exponential in the number of variables). On the other hand, if we can enumerate the solutions, we can count them without significant additional effort. #MAX-CLIQUE is #P-complete, which is at least as hard as its associated N P-hard problem (MAX-CLIQUE), because if we could solve #MAX-CLIQUE, we obviously would be able to solve MAX-CLIQUE [22]. The main idea for sampling using an already available counting procedure is to restrict the search space iteratively, until the search space is small enough to count its solutions. At this point, if there is a small number of solutions left, they will be enumerated and one of them will be sampled uniformly at random [1]. Obviously, if it is hard to find a solution of the instance, reducing its number of solutions does not make the enumeration 9 Chapter 2. Related work of its solutions easy [19]. Many of the exact counting methods, including Relsat [23] and Cachet [24], for the #P-complete problems do their search on the raw combinatorial search space; therefore, these methods usually are not efficient when the size of the instances scale up [25]. It is also possible to approximate the number of solutions using a nearuniform sampler. A method of approximating the number of satisfying assignments of a SAT formula using a near-uniform sampler by Jerrum et al. in [26] is as follows. Consider a Boolean formula F with N satisfying assignments such that we are able to sample its satisfying assignments uniformly at random. We can partition the set of sampled satisfying assignments based on the truth value of a variable x1 . From this partitioning we can compute the number of satisfying assignments that have x1 as true, M+ , divided by the total number of samples. Let γ be M+ /M (1/γ is called the multiplier). The value of γ converges to the true fraction of satisfying assignments with x1 set to true which means we can get an approximation on the number of satisfying assignments (1/γ)·M+ . In this way, they have simplified the problem of counting the satisfying assignments of F to a simpler formula F + with x1 set to true. This simplification can be performed recursively until either all of the variables are exhausted or the formula is countable by a state-of-the-art exact counter [26]. Any solver for the problem of S-CLIQUE must essentially finds target cliques of a given graph, similar to what any heuristic MAX-CLIQUE solver would do. In Section 2.1, we provide a comprehensive review of state-ofthe-art heuristic MAX-CLIQUE algorithms. In contrast to the problem of k-CLIQUE, for the problems of SAT and constraint satisfaction (CSP) some sampling procedures are available [1, 27–31]. There are various ways of translating the problems of k-CLIQUE, SAT and CSP to each other [32– 37]. We may be able to sample target cliques of a graph near-uniformly, by first translating parsimoniously the problem of k-CLIQUE to SAT or CSP, sample their solutions by a SAT or CSP near-uniform sampler respectively, and then translate their solutions back to cliques. In Sections 2.2 and 2.3 we therefore review some of the sampling procedures available for the problems of SAT and CSP. 2.1 Heuristic MAX-CLIQUE algorithms This section provides some comparisons amongst many of the state-of-theart heuristic MAX-CLIQUE algorithms. One of the common methods of 10 Chapter 2. Related work searching for target cliques of a graph is local search. A state-of-the-art dynamic local search MAX-CLIQUE algorithm is DLS-MC [2]. DLS-MC is based on a combination of constructive and perturbative local search strategies, and makes use of penalty values associated with the vertices of graphs. A more detailed description of DLS-MC is presented in Chapter 4. All the experiments of DLS-MC in [2] are performed on DIMACS MAX-CLIQUE instances2 . DLS-MC finds the at least largest known results of 77 of the 80 DIMACS MAX-CLIQUE instances with a success rate of 100% and run times that are typically much less than 1 CPU second. The 3 instances that DLS-MC did not always find their largest known cliques within the alloted maximum number of search steps are C2000.9, MANN a81 and MANN a45. The experiments of DLS-MC were performed on a 2.2 GHz Pentium IV machine with 512KB L2 cache and 512MB RAM. Two new local search methods based on DLS-MC were proposed to almost eliminate the parameters of DLS-MC. These two methods are phased local search (PLS) [3] and the more recent algorithms by Grosso et al. [9]. They are very powerful algorithms; however, there are some DIMACS MAXCLIQUE instances for which these algorithms unlike DLS-MC either can not consistently find the largest known cliques or they can not find any of the largest known cliques at all. PLS as its name suggests has three phases. Each of these phases are designed to find target cliques with different characteristics. One phase is designed to find target cliques in random graphs, the second phase to find target cliques with low degree vertices, and the last phase to find target cliques with high degree vertices. PLS has comparable, and sometimes improved, performance to DLS-MC on almost all the 80 DIMACS instances. PLS does not consistently find the largest known cliques of 4 of the 80 instances. The instance that DLS-MC found its solution consistently, but PLS did not, is keller6. All the experiments of the PLS in [3] were performed on a dedicated 2.4 GHz Pentium IV machine with 512KB L2 cache and 512MB RAM. To excute the DIMACS Machine Benchmark this machine requires 0.62 CPU seconds for r300.5, 3.80 CPU seconds for r400.5 and 14.50 CPU seconds for r500.5. This machine seems to be faster than the one used for the experiments of DLS-MC. The special feature of the first algorithm of Grosso et al. is the use of extensive plateau steps to find sequences of neighboring cliques. A plateau step replaces a vertex of the current clique with a vertex from the graph 2 available at ftp://dimacs.rutgers.edu/pub/challenge/graph 11 Chapter 2. Related work that does not change the size of the clique. There is no penalty assigned to the vertices of the graph; however, the vertices that are removed from the current clique, in a plateau step, are maintained in a set U to prohibit the reconstruction of the same clique again. The vertices of the set U can be added back to the clique only if they increase its size. The other property of this algorithm is that the selection of vertices is completely random. The second algorithm of Grosso et al. associates penalty values with vertices of graphs. The penalty values are updated after each iteration of the algorithm that is - after either a plateau or an increasing step. During an increasing step a new vertex is added to the current clique. Grosso et al. performed their experiments in [9] on a machine that seems to have very similar performance as the one used for the experiments of PLS. Unfortunately, no comparison was performed directly between the algorithms by Grosso et al. and the two algorithms of DLS-MC and PLS. Grosso et al. compared their algorithms with the reactive local search (RLS) by Battiti et al. [38] and a variable neighborhood search (VNS) based algorithm by Hansen et al. [39]. The algorithms of Grosso et al. perform better than RLS and VNS with respect to either the size of cliques constructed within almost the same time, or the time required to construct cliques of the same size. The algorithms by Grosso et al. find the larget known cliques for all the 80 DIMACS MAX-CLIQUE instances except instance MANN a81 and increased the best known size [2] of cliques for C2000.9 instance. Algorithm 1 of Grosso et al. with restart rule 2 did not consistently find any of the largest known cliques for 9 of 35 DIMACS MAX-CLIQUE instances reported in [9] and their second algorithm with restart rule 2 did not consistently detect any of the largest known cliques for 5 of 35 DIMACS MAX-CLIQUE instances reported in [9] (different restart rules resolved this issue in many cases). These 35 instances are from the following families of graphs: p hat, san, sanr, C, brock, hamming, MANN and keller [9]. PLS based on the tables in [3] when compared with the first algorithm of Grosso et al. using restart rule 2 either found a larger clique within almost the same time, or found the largest known cliques faster for 20 of the 35 DIMACS graphs reported in [9] by Grosso et al. PLS based on the tables in [3] when compared with the second algorithm of Grosso et al. using restart rule 2 either found a larger clique with almost the same time, or found the largest known cliques faster for 19 of the 35 DIMACS graphs reported in [9] by Grosso et al. The empirical result of an algorithm for the problem of MAX-CLIQUE by Tomita et al. (MCR) [40] based on their earlier work [41] for the gen12 Chapter 2. Related work eration of all maximal cliques is also compared with the empirical result of DLS-MC [2] here. It is very difficult to judge whether the two machines that the experiments of DLS-MC and MCR were performed on have similar power, but their machine’s specifications up to the point that is provided are the same and the paper on MCR is published one year after the paper of DLS-MC. A comparison based on Table 3 of [40] and Table 1 of [2] shows that DLS-MC is much faster than MCR. The MAX-CLIQUE problem may be formulated in various forms such as a continuous optimization problem [42], a SAT problem [43], a neural network [44] or an integer programming problem [45]. A recent enumerative heuristic (CEH) based on continuous optimization by Bul` o et al. has a goal close to that of this thesis [42]. Their goal is to ideally report k (a user-defined parameter) largest cliques of the graph enumerated at the end of each run. CEH was overall better than the other four related methods that are compared with (inverted neural network [44], annealed imitation heuristic [46], continuous based heuristic [47], QUALEX-MS [48]), but was inferior to the reactive local search method (RLS) [38]. RLS is a tabu search method that automatically adjusts the control parameter for the amount of diversification from the global optimum during the search process. Bul`o et al.’s experiments were based on an unoptimized C implementation on a 64-bit P C with a 2 GHz AMD Opteron Processor and 1 GB RAM. The experiments were performed on 63 of the 80 DIMACS instances from the brock, c-fat, hamming, johnson, keller, MANN, p hat, san and sanr families. CEH did not generate any clique with the largest known size for 21 of the graphs (the cut-off time for the brock instances that it failed on was an average time of 29.98 CPU seconds, for MANN a45 was an average time of 528 CPU seconds, for the p hat instances that it failed on was an average time of 172.1 CPU seconds and for the san instances that it failed on was an average time of 25.51 CPU seconds). It may not be possible to directly compare the run times of DLS-MC with CEH, since they were measured using different hardware and operating system environments, but DLS-MC seems to be much faster than CEH [42]. QUALEX-MS is another heuristic MAX-CLIQUE algorithm based on continuous optimization [48]. All the experiments reported in [48] are on a Pentium IV 1.4 GHz computer under OS Linux RedHat. This machine seems to be slower than the ones used for evaluating the other algorithms discussed in this section. QUALEX-MS failed to find the largest known results for 23 of 80 DIMACS instances (the cut-off time for the keller instances that it failed on was an average time of 653.5 CPU seconds, for the C instances that it 13 Chapter 2. Related work failed on was an average time of 647.75 CPU seconds, for DSJC1000.5 was an average time of 36 CPU seconds, for the gen instances that it failed on was an average time of 1.5 CPU seconds, for hamming10-4 was an average time of 45 CPU seconds, for the MANN instances that it failed on was an average time 165 CPU seconds, for the p hat instances that it failed on was an average time of 48.33 CPU seconds, for san400 0.9 1 was an average time 2 CPU seconds and for sanr400 0.7 was an average time of 2 CPU seconds). The run times of QUALEX-MS on the other 57 instances are usually close to 1 CPU second, considering the fact they use a new eigendecomposition routine DSYEVR from LAPACK [49], which helped to reduce the run times. As already mentioned DLS-MC is able to find the at least largest known results for 77 of the 80 DIMACS instances with a success rate of 100% and run times that are typically much less than 1 CPU second. Other recent heuristic MAX-CLIQUE algorithms based on continuous optimization such as annealed replication [50], annealed imitation [51] and inverted neural network [44] also failed to report the largest known results for many of the DIMACS instances. Another recent approach by R´egin uses constraint programming [36] based on the first of its kind by Fahle [52], which is a branch-and-bound algorithm. R´egin’s method failed to find the largest known solutions for 13 of 66 DIMACS instances used. DLS-MC performs better than this algorithm as it can consistently find the largest known results for 77 of the 80 DIMACS MAX-CLIQUE instances. The run times of DLS-MC are typically much less than 1 CPU second. A vertex-cover based approach by Taillon [53] failed to find any of the largest known cliques for 8 of 41 DIMACS MAX-CLIQUE instances used in [53] in any of the 50 runs. These 8 DIMACS MAX-CLIQUE instances are brock400-1, brock800-1, brock800-2, brock800-3 and brock800-4, hamming104, MANN a27 and p hat1000-3. This method also failed to find the largest known cliques in 6 of the 41 DIMACS instances consistently. These 6 instances are brock400-2, brock400-3, brock400-4, p hat500-3, p hat700-3, p hat1500-1, p hat1500-2 and san1000. DLS-MC performs better than this algorithm for the same reason mentioned above. A recent k-Opt local search algorithm by Katayama et al. [54] was compared with four other state-of-the-art algorithms: an algorithm based on reactive local search [38], DLS-MC, an algorithm based on variable neighborhood search [39], and an algorithm based on steady-state genetic [55]. k-Opt is a greedy method that adds k vertices to the current clique at the end of each iteration [56]. Table 2 of [54] compares these five algorithms to each other using 37 of DIMACS MAX-CLIQUE graphs. These DIMACS MAX14 Chapter 2. Related work CLIQUE instances are from the following families of graphs: C, DSJC, MANN, brock, gen, hamming, keller and p hat. DLS-MC performs much better than the other 4 algorithms on all 37 of the 80 DIMACS instances either by having the largest clique size within almost the same time, or the least time required to report any of the largest known cliques except for 6 of the instances from keller, MANN and C families. As it is explained in Section 4.2 these instances are not considered in the experiments of this thesis; because of the large number of solutions of these instances. We also considered a very recent algorithm based on ant colony optimization to empirically approach the problem of MAX-CLIQUE. Xu et al. [57] performed their experiments on 12 of the 80 DIMACS MAX-CLIQUE instances. These 12 instances are from the following families of graphs: MANN, brock, C, hamming, p hat and keller. For 5 of these 12 instances, it either did not reach any of the largest known clique sizes at all or did not detect the largest known clique sizes consistently. These 5 instances are: MANN a27, brock400-2, brock800-2, C500.9, keller6. As it is known from literature, there is no single best heuristic MAXCLIQUE algorithm, but rather a set of them [2, 58]. We will use DLS-MC and PLS because of their power to find the largest known cliques of DIMACS MAX-CLIQUE instances very quickly. We use efficient algorithms to be able to sample target cliques of graphs a large multiple of m in a reasonable time to find a good approximation for the distribution of the sampled target cliques within a graph with m target cliques. 2.2 Near-uniform SAT samplers N P-complete problems are efficiently reducible to each other. A parsimonious translation of an N P-complete problem to another one preserves the number of solutions of the original problem. SAT seems a particularly good target because current SAT solvers can find satisfying assignments of some large and complex SAT instances, involving up to a million variables and several million constraints and also there are already near-uniform sampling approaches designed for SAT [1, 31]. If we had a uniform SAT sampler, then after translating a k-CLIQUE instance to a SAT instance parsimoniously we could use the uniform SAT sampler to sample near-uniformly its set of satisfying assignments and translate the satisfying assignments back to cliques. There has been already some work to sample uniformly the set of satisfying assignments of a SAT instance. Wei et al. showed that ideas from 15 Chapter 2. Related work random walk based SAT solvers can be used to sample the models of certain propositional theories near-uniformly [31]. The SAT solver used by Wei et al. is WalkSAT, which is a random walk based solver with a greedy bias. WalkSAT starts with a random truth assignment, and as long as there is an unsatisfied clause the algorithm picks an unsatisfied clause and flips the truth assignment of one of its variables, until the maximum number of flips specified by a user is reached. The clause is picked uniformly at random from the unsatisfied clauses. The variable that is picked results in the fewest previously satisfied clauses becoming unsatisfied. If there is more than one variable with that outcome, one of them is picked uniformly at random. There is also some probability of picking the variable at random from the set of all variables. WalkSAT may restart with a new random truth assignment if no satisfying assignment is found for a long time [59]. Due to the greedy component of WalkSAT, there is no guarantee that all of the satisfying assignments are reached within a bounded time. Wei et al. also showed that even the satisfying assignments within the same cluster (i.e. a set of satisfying assignments within a given Hamming distance from a specified point) may not be sampled with the same probability. Monte-Carlo methods can sample near-uniformly the satisfying assignments of a cluster if one of the satisfying assignments of the cluster is provided, but can not generally reach a satisfying assignment. A Monte-Carlo method for SAT can be summarized as follows: a fixed user-defined parameter called temperature, T , is given. It starts with an initial truth assignment s. While either a satisfying assignment is not reached or the maximum number of Monte-Carlo steps allowed is not reached, a new truth assignment s is accepted instead of s with probability of f (s) − f (s ) exp[ ] T where f returns the number of unsatisfied clauses of the given truth assignment [60]. Wei et al. used a hybrid strategy to sample the satisfying assignments near-uniformly within a cluster. The method interleaves between biased random walk steps and Monte-Carlo steps with some probability p. The temperature of Monte-Carlo was manually determined and set to 0.1 to achieve uniform sampling to some degree. As noted by Wei et al. their strategy works on some instances, but there are cases where their method does not sample the satisfying assignments uniformly. Another attempt to sample the set of satisfying assignments uniformly is XORSample [1]. The main idea behind XORSample is to randomly restrict the search space and sample a satisfying assignment from the smaller search space. In order to restrict a SAT search space, the CNF formulas 16 Chapter 2. Related work are augmented by XOR clauses. An XOR clause D over the variables X of a Boolean formula F , is the logical exclusive-or of a subset of X ∪ {1} converted to CNF format. An XOR clause can eliminate a satisfying assignment of the CNF formula with probability 1/2, since either an odd number of XOR variables are true in the satisfying assignment or even. Under an assumption of having full independence3 among the satisfying assignments, XORSample could result in uniform sampling; however, XOR clauses only guarantee pair-wise independence4 and 3-wise independence; Gomes et al. [1] show theoretically that this is sufficient to reach near-uniform sampling. There are two variants SAT samplers by Grosso et al. The first variant, XORSample, is described as follows. For a SAT instance with 2s satisfying assignments (the number of solutions is assumed to be known for simplicity), s XOR clauses are added to the original formula. After XOR clauses are added to the original SAT formula F , a subset of satisfying assignments of F survive. If there is a unique satisfying assignment remaining, then XORSample reports that satisfying assignment and terminates. Otherwise, if the original formula F does not have a unique solution, XORSample discards the XOR clauses Qs and the formula Fsq and iterates the process by constructing a new constrained formula using new XOR clauses (rejection sampling). Another variant of XORSample does not require the XOR clauses to reduce the number of satisfying assignments to one, but rather a small enough number such that an exact model counting procedure, e.g., Relsat [23] or Cachet [24], hopefully can count its number of satisfying assignments. An enumeration algorithm for SAT, e.g., Relsat [23] or mChaff [61], can then be used to enumerate the remaining satisfying assignments. One of the enumerated satisfying assignments is then sampled uniformly at random. For a detailed explanation of this algorithm, see Section 3.2. 2.3 Near-uniform CSP samplers In this section we present some of the recent sampling procedures available for the problem of constraint satisfaction (CSP). CSP is also a good target of encoding, as there are already many sampling procedures available for this 3 Full independence: If the XOR clause eliminates a satisfying assignment, then we have no information whether any other satisfying assignments could be eliminated. 4 Pair-wise independence: If we know the XOR clause eliminates a satisfying assignment σ1 , then this gives us no information whether it also removes any other satisfying assignment σ2 . 17 Chapter 2. Related work problem [27–30]. There are also many encodings available for translating a SAT or k-CLIQUE instance to a CSP instance [32–37]. Larkin [30] proposed a method of sampling each solution of a CSP instance with some well defined probabilities. Larkin claimed that the method is the first of its kind. Some recent CSP near-uniform samplers are as follows. Dechter et al. [27] introduced a sampling procedure for uniform sampling in the context of CSP. They introduced a way of generating solutions of a CSP instance uniformly at random by first reducing a CSP to a belief network that expresses a uniform random distribution over its set of solutions. They then use a near-uniform sampler over belief networks [62] to sample its solutions. Gogate and Dechter provided another method, SampleSearch [28], for sampling near-uniformly the solutions of CSP. In this approach they first express the uniform distribution over the set of solutions as a probability distribution and then sample near-uniformly the set of solutions of CSP using a Monte-Carlo sampling procedure. They used an approach, IJGP, empirically shown to be good to approximate the probability distribution of solutions [63]. Gogate and Dechter also proposed a sampling method [29] based on Sampling/Importance Resampling [64]. They first using SampleSearch [28] to generate an initial set A of samples from the distribution of solutions. Each sample is then weighted by the reciprocal of its probability of being generated from the distribution of solutions. After forming a distribution M over the initial set A by normalizing the weights, a smaller set of samples B is drawn from M (resampling). The contribution of Gogate and Dechter to this resampling procedure is to store the set A as an AND/OR structure. We did not take the approach of using CSP samplers; however, it could be studied as a possible future work. 18 Chapter 3 Near-uniform sampling by translating k-CLIQUE to SAT As there are already several approaches towards uniform sampling of satisfying assignments of SAT formulas (see Section 2.4), one may translate a k-CLIQUE instance to a SAT instance parsimoniously, then sample the satisfying assignments of the SAT instance using the available SAT samplers and translate the satisfying assignments back to cliques of size k. There are various ways of directly translating k-CLIQUE to SAT [43, 65, 66]. There are also other ways of translating k-CLIQUE to SAT through a chain of reductions by first translating k-CLIQUE to CSP and then possibly to SAT [32–37]. Some of these encodings are called compact by Hoos [67] because of the relatively small number of variables they require (and hence the smaller search space they induce). For example, Iwama et al. in [65] introduced quite complicated methods to decrease the number of variables required to translate a k-CLIQUE instance to a SAT instance as they believed that instances with smaller number of variables (i.e. smaller search space) are easier to solve; however, Hoos by using other measurements distinguished between instances with small search space and effectively solvable ones, and showed that local search SAT solvers typically have difficulty solving these compact instances [67]. In the remaining of this chapter, first using a sparse encoding of kCLIQUE to SAT [66], we translated some of the DIMACS MAX-CLIQUE graphs to Boolean formulas. Next, with two of the SAT samplers (XORSample and XORSample ) [1] described in Section 2.4, we sampled the satisfying assignments of the encoded instances and translate them back to cliques. For only two of the DIMACS MAX-CLIQUE graphs this approach was feasible as either the size of the encoded SAT instances became huge or the SAT solvers were not able to decide reasonably quickly whether those instances were satisfiable. 19 Chapter 3. Near-uniform sampling by translating k-CLIQUE to SAT 3.1 An encoding of k-CLIQUE to SAT In this section, we present the encoding we used to translate the DIMACS MAX-CLIQUE instances to SAT instances. Since different encodings affect the performance of SAT solvers, we need to choose one carefully. Clark et al. [68] showed empirically that hard instances usually have few number of satisfying assignments. However, number of solutions is not the only factor affecting the hardness of SAT instances as the hardness of some instances with the same number of solutions vary significantly [68]. Frank et al. showed that as the number and the size of local minima regions increase, it is more likely that the SAT solvers take more time to discover a satisfying assignment [69]. Hoos [67] introduced the standard deviation of the objective function and the local minima branching factor (i.e. the number of neighboring assignments with the same value of objective function) to measure the effectiveness of different encodings. Stochastic local search solvers search more effectively instances with large value of standard deviation of the objective function which indicates a rugged search space. Instances with highly branched local minima regions are also hard instances as there are few escape routes from these minima regions. Based on these two measurement and solution density (i.e. number of solutions/size of search space), it was suggested by Hoos that a sparse encoding most likely leads to a more effectively searchable SAT instance for local search solvers [67]. Hence, we follow a sparse encoding [66] of k-CLIQUE to SAT rather than a compact one [65] as was also used by Kugele for the same purpose [66]. Given a k-CLIQUE instance G, k , where G =(V, E) with a set of vertices V = {v1 , v2 , ..., vn }, a set of edges E ⊆ V 2 and k is the target clique size, we construct a SAT instance in the form of a CNF formula F as described in the following. Let {Xi,j : 1≤i≤n and 1≤j≤k}, where n is the number of vertices of G and k is the target clique size, and {Si : 1≤i≤n}, where n is the number of vertices of G, be two sets of Boolean variables. We will use variable Xi,j to encode whether the vertex Vi is the j th node of a size k clique of the graph G, and variable Si is an auxiliary variable that is true if and only if vertex i is part of a clique of size k. F is a CNF formula over those variables and consists of conjunction of CNF clauses Ci (C1 ∧ C2 ∧ ... ∧ Cw ), where Ci = (l1 ∨ l2 ∨ ... ∨ lz ) is a disjunction of literals, and a literal is a Boolean variable x or its negation x ¯. In this encoding there are 4 classes of clauses: 20 Chapter 3. Near-uniform sampling by translating k-CLIQUE to SAT • φ1 : (X1,j ∨ X2,j ∨ ... ∨ Xn,j ) for each j from 1 to k. Since there are k nodes in a clique, there must be at least one variable Xi,j set to true for each j from 1 to k. There are k clauses of this type. ¯ i,s ∨ X ¯ i,t ) for each i from 1 to n, and each pair of s and t from • φ2 : (X pairs of numbers from 1 to k such that s < t. There are k2 ·n clauses of this type. We need this type of clauses because a clique can not contain multiple copies of the same vertex. No pair of variables Xi,j and Xg,h should have the value of true, if the vertices Vi and Vg are not adjacent in the graph G. The following two types of clauses allow only the adjacent vertices to construct a clique. • φ3 : (Xi,1 ∨ Xi,2 ∨ ... ∨ Xi,k ) ⇒ Si . φ3 in CNF format is represented ¯ i,1 ∨ Si ) ∧ (X ¯ i,2 ∨ Si ) ∧ ... ∧ (X ¯ i,k ∨ Si ). as the following k clauses: (X There are k·n CNF clauses of this type. • φ4 : (S¯i ∨ S¯g ) for every pair of non-adjacent vertices Vi and Vg . There are n2 −(# of edges in G) clauses of this type. The formula ω is the encoding of a CLIQUE instance G, k as a SAT instance, which is defined as φ1 ∧ φ2 ∧ φ3 ∧ φ4 . This encoding requires O(k 2 ·n) clauses and O(k·n) variables. An example of the above translation on a small graph with 4 vertices, is represented in Figure 3.1. • φ1 : There are k = 3 clauses of this type. – (X1,1 ∨ X2,1 ∨ X3,1 ∨ X4,1 ) an Xi,1 , 1≤i≤4, representing the 1st vertex of the clique – (X1,2 ∨ X2,2 ∨ X3,2 ∨ X4,2 ) an Xi,2 , 1≤i≤4, representing the 2nd vertex of the clique – (X1,3 ∨ X2,3 ∨ X3,3 ∨ X4,3 ) an Xi,3 , 1≤i≤4, representing the 3rd vertex of the clique • φ2 : There are k2 · n = 12 clauses of this type. The following clauses guarantee that a clique does not contain multiple copies of the same vertex. – Only one of the vertices is associated with X1 : (3 clauses) ¯ 1,1 ∨ X ¯ 1,2 ) ∧ (X ¯ 1,1 ∨ X ¯ 1,3 ) ∧ (X ¯ 1,2 ∨ X ¯ 1,3 ) (X 21 Chapter 3. Near-uniform sampling by translating k-CLIQUE to SAT v2 v1 v3 v4 Figure 3.1: A graph with a maximum clique of size 3. 22 Chapter 3. Near-uniform sampling by translating k-CLIQUE to SAT – Only one of the vertices is associated with X2 : (3 clauses) ¯ 2,1 ∨ X ¯ 2,2 ) ∧ (X ¯ 2,1 ∨ X ¯ 2,3 ) ∧ (X ¯ 2,2 ∨ X ¯ 2,3 ) (X – Only one of the vertices is associated with X3 : (3 clauses) ¯ 3,1 ∨ X ¯ 3,2 ) ∧ (X ¯ 3,1 ∨ X ¯ 3,3 ) ∧ (X ¯ 3,2 ∨ X ¯ 3,3 ) (X – Only one of the vertices is associated with X4 : (3 clauses) ¯ 4,1 ∨ X ¯ 4,3 ) ∧ (X ¯ 4,2 ∨ X ¯ 4,3 ) ¯ 4,1 ∨ X ¯ 4,2 ) ∧ (X (X • φ3 : There are k·n = 12 clauses of this type. – S1 is true if v1 is a vertex of the clique ¯ 1,1 ∨ S1 ) ∧ (X ¯ 1,2 ∨ S1 ) ∧ (X ¯ 1,3 ∨ S1 ) (X – S2 is true if v2 is a vertex of the clique ¯ 2,1 ∨ S2 ) ∧ (X ¯ 2,2 ∨ S2 ) ∧ (X ¯ 2,3 ∨ S2 ) (X – S3 is true if v3 is a vertex of the clique ¯ 3,1 ∨ S3 ) ∧ (X ¯ 3,2 ∨ S3 ) ∧ (X ¯ 3,3 ∨ S3 ) (X – S4 is true if v4 is a vertex of the clique ¯ 4,1 ∨ S4 ) ∧ (X ¯ 4,2 ∨ S4 ) ∧ (X ¯ 4,3 ∨ S4 ) (X • φ4 : There are k = 2 clauses of this type. (S¯1 ∨ S¯4 ) ∧ (S¯2 ∨ S¯4 ) Using this encoding, if the number of target cliques (with size k) of the graph is t, then the total number of satisfying assignments of the SAT instance is k!·t. The encoded instance has k!·t satisfying assignments, because any permutation of the SAT variables results in a new satisfying assignments for the same target clique. Even though this encoding is not parsimonious, it does not affect the relative frequencies of the target cliques as each clique is repeated exactly the same number of times (k!). An advantage of this encoding is the existence of symmetry between its solutions. It has been shown by Prestwich that instances with symmetrical solutions are easier for local search algorithms [70]. 3.2 Two effective methods of sampling As stated in Section 2.4, XORSample and XORSample are two efficient near-uniform SAT samplers. The main idea behind these two algorithms introduced by Gomes et al. [1] is to add special randomly generated logical constraints to the original formulas such that each constraint eliminates any given truth assignment exactly with probability of 1/2. Hence, if the original 23 Chapter 3. Near-uniform sampling by translating k-CLIQUE to SAT formula has 2s satisfying assignments, it is expected to have a SAT formula with exactly 1 satisfying assignment after adding s random constraints. By repeating this process with a random new set of s constraints, a different satisfying assignment may be sampled. Gomes et al. used XOR clauses to introduce extra constraints on the variables of the new SAT formula. An XOR clause D over the variables of F , X, is the logical exclusive-or of a subset of X ∪ {1} converted to CNF format. An example of an XOR clause is (x1 ⊕ x2 ⊕ ... ⊕ xi ). An XOR clause is satisfied if an odd number of variables in the clause is assigned true. The reason the constant 1 (i.e. an elementary expression that is always true) is used in the construction of XOR clauses is to express even parity. The related useful property of an XOR clause is that it eliminates any given truth assignment (not necessarily a solution) with probability 1/2, and therefore in expectation cuts the set of satisfying assignments (solutions) in half. This expectation happens only when elimination of a satisfying assignment is fully independent of the elimination of other satisfying assignments; however, there is no known compact (polynomial size) logical constraint that cause elimination of a satisfying assignment to be fully independent of the elimination of any other assignments [1]. The choice of XOR as a constraint guarantees at least pairwise and 3-wise independences. Gomes et al. proved that this pairwise and 3-wise independences is enough for a near-uniform sampling [1]. For the experiments, we used the implementation of these algorithms provided by Gomes et al.5 3.2.1 XORSample algorithm XORSample is an iterative procedure that takes a CNF formula F , the probability q of including a variable in an XOR clause, number s of XOR clauses, range [m , x] of average length (i.e. number of variables) of each XOR clause, number #var of variables in the original CNF formula, the variables #xorvar that are used in the construction of XOR clauses, and maximum number of iterations mni of XORSample. In standard usage, #var equals #xorvar. More generally, if the number of variables is 150, but one wants to only use the first 100 variables in the construction of XOR clauses, #var is set to 100, and #xorvar is set to 150. For the parameter s, Gomes et al. suggested to use log2 (ns), where ns is the expected number of satisfying assignments [1]. XORSample outputs a satisfying assignments after it terminates, only if F is satisfiable. 5 available at http://www.cs.cornell.edu/∼sabhar/ 24 Chapter 3. Near-uniform sampling by translating k-CLIQUE to SAT In each iteration, it adds s random XOR constraints Qs , drawn independently from the distribution X(n, q), to the original SAT formula F to generate a new SAT formula Fsq . X(n, q) denotes the probability distribution over the set of all XOR clauses X with variables selected from the SAT variables with probability q and the constant 1 is also added to XOR clauses independently with probability 1/2. After XOR clauses are added to the original SAT formula F , a subset of satisfying assignments of F survive. If there is a unique satisfying assignment remaining, then XORSample reports that satisfying assignment and terminates. In order to check whether a SAT formula has a unique satisfying assignment σ, the negation of that satisfying assignment σ ¯ is added to the formula, and then if the formula becomes unsatisfied, the satisfying assignment σ has been its only satisfying assignment. Otherwise, if the original formula F does not have a unique solution, XORSample discards the XOR clauses Qs and the formula Fsq and iterates the process by constructing a new constrained formula using new XOR clauses (rejection sampling). XORSample uses a SAT solver (denoted as SAT Solver(F )), such as minisat [71] which takes a CNF formula F to get the pair of satisfiability status and the possible remaining satisfying assignment of F . The full description of XORSample is presented in Procedure XORSample(F , q, s, m, x, #var, #xorvar,mni). 3.2.2 XORSample algorithm XORSample has the same parameters as XORSample. It starts by adding to the original CNF formula s randomly chosen XOR clauses drawn independently from X(n, q) to generate a new SAT formula Fsq . A SAT model counter, e.g., Relsat [23] or Cachet [24], is then used to count the number of satisfying assignments, ns, of Fsq . If the number of satisfying assignments of Fsq is not zero, XORSample succeeds and reports the ith surviving satisfying assignment from the set of satisfying assignments of Fsq enumerated by a SAT enumeration algorithm, e.g., Relsat [23] or mChaff [61] (i is chosen uniformly at random from the set {1, 2, ..., ns})6 . The advantage of XORSample is that it does not need to generate instances that have only one solution, but merely instances that have a small number of solutions. The details of this algorithm is described in Procedure XORSample (F , q, s, m, x, #var, #xorvar). In this procedure, SAT M odelCount(F ) denotes 6 In particular, the satisfying assignments are stored in a file, each on a separate line, and the solution from the ith line is reported. 25 Chapter 3. Near-uniform sampling by translating k-CLIQUE to SAT Procedure XORSample(F , q, s, m, x, #var, #xorvar, mni) Input: A CNF formula F , probability q, positive integer s, minimum number of variables m in each XOR clause, maximum number of variables x in each XOR clause, number of variables #var of the CNF formula, the variables #xorvar used in the construction of XOR clauses, maximum number of iterations mni Output: A satisfying assignment of F , or Failure begin (satisfiable, σ) ← SAT Solver(F ) /* Decide whether F is satisfiable */ if satisfiable = FALSE then return Failure /* no solution exists */ end counter ← 0 iterationSuccessful ← FALSE while iterationSuccessful = FALSE and counter ≤ mni do Qs ← {s random constraints independently drawn from X(n, q)} Fsq ← F ∪ Qs /* Add s random XOR constraints to F */ (satisfiable, σ) ← SAT Solver(Fsq ) /* Decide whether Fsq is satisfiable and also store the possible satisfying assignment σ reported by the SAT solver */ if satisfiable = TRUE then F ← Fsq ∪ {¯ σ } /* Remove σ from the set of satisfying assignments */ (satisfiable ,σ ) ← SAT Solver(F ) if satisfiable = FALSE then iterationSuccessful ← TRUE return σ /* Reporte σ, which is the unique satisfying assignment of Fsq */ end end counter ← counter + 1 end return Failure end 26 Chapter 3. Near-uniform sampling by translating k-CLIQUE to SAT Procedure XORSample (F , q, s, m, x, #var, #xorvar ) Input: A CNF formula F , probability q, positive integer s, minimum number of variables in each XOR clause m, maximum number of variables in each XOR clause x, number of variables of the CNF formula #var, the variables used in the construction of XOR clauses #xorvar Output: A satisfying assignment of F , or Failure begin Qs ← {s constraints randomly drawn from X(n, q)} /* Add s random XOR constraints to F */ Fsq ← F ∪ Qs ns ← SAT M odelCount(Fsq ) /* Count the number of satisfying assignments of Fsq using an exact model counter */ if ns = 0 then i ← a number chosen uniformly at random from {1, 2, ..., ns} A ← Enumerate(Fsq ) /* Enumerate the satisfying assignments of Fsq using a SAT enumerator */ return the ith satisfying assignment from the set of satisfying assignments A /* Sampled successfully! */ end else return Failure end end a call to any exact model counter that takes a CNF formula and returns its number of satisfying assignments. Enumerate(F ) is a call to any enumeration algorithm that takes a CNF formula and returns the set of all its satisfying assignments. 3.3 Empirical results The experiments of this section are designed to investigate the sampling performance and practicality of the method proposed in this chapter. The samplers used for these experiments are XORSample and XORSample . These 27 Chapter 3. Near-uniform sampling by translating k-CLIQUE to SAT experiments were performed on a dedicated 2.4 GHz Xeon (dual) machine with 512 KB cache and 1 GB RAM, running SuSE Linux 10.1. As already stated both of the samplers use the same parameters. They take a SAT formula F , the probability of including a variable in an XOR clause q, number of XOR clauses s, range of length (i.e. number of variables) of each XOR clause [m , x], number of variables in the original CNF formula #var, and the variables that are used in the construction of XOR clauses #xorvar. In standard usage, #var equals #xorvar. XORSample has an extra parameter mni which is the maximum number of iterations of the algorithm. We tried to encode all of the DIMACS MAX-CLIQUE instances used in this thesis, but the size of many of them after encoding were huge (see Table 3.1). The number of solutions of the encoded instances may affect the time required to sample a solution by XORSample. If a SAT instance has a huge number of solutions, then XORSample should constrain the instance with a sufficient number of XOR clauses such that the number of solutions reduces to one; however, at the same time having many clauses usually increases the difficulty of the instance. Table 3.1 shows that most of the encoded DIMACS MAX-CLIQUE instances used in this thesis have a huge number of solutions. For a discussion on why we used these instances, see Section 4.2 in Chapter 4. The experiments of this section are performed on two of these instances that have a relatively small size and small number of solutions, hamming6-4 and johnson8-2-4. We used XORSample to sample the satisfying assignments of all the other DIMACS MAX-CLIQUE graphs shown in Table 3.1, but each iteration of XORSample took at least 30 CPU seconds, which means even if XORSample needed only 2 iterations to sample a satisfying assignment, the total run time of XORSample to report a satisfying assignment would become at least 1 CPU minute. In practice, XORSample needs an even higher number of iterations to sample a satisfying assignment as it needs to generate a set of XOR clauses that reduce the number of satisfying assignments of the original SAT formula to a unique one. In the implementation provided by Gomes et al. for XORSample and XORSample the value of q is the average length of XOR clauses divided by the number of variables of the original CNF formula. The average length of XOR clauses is m/x. For the parameter s, we use log2 (ns) suggested by Gomes et al., where ns is the expected number of satisfying assignments [1]. The minimum and maximum length of XOR clauses (m and x) was not discussed in detail in the theory of XORSample and XORSample [1]; however, Gomes et al. in a work related to sampling [72] demonstrated 28 Chapter 3. Near-uniform sampling by translating k-CLIQUE to SAT Instance brock200 1 brock200 2 brock200 3 brock200 4 brock400 3 brock400 4 c-fat200-1 c-fat200-2 c-fat200-5 c-fat500-1 c-fat500-2 c-fat500-5 c-fat500-10 gen200 p0.9 44 gen200 p0.9 55 gen400 p0.9 65 gen400 p0.9 75 hamming6-2 hamming6-4 hamming8-4 hamming10-2 johnson8-2-4 johnson8-4-4 p hat1000-1 p hat1000-2 p hat300-1 p hat300-2 p hat300-3 p hat500-1 p hat500-2 p hat500-3 p hat700-1 p hat700-2 san200 0.7 1 san200 0.7 2 san200 0.9 1 san200 0.9 2 san200 0.9 3 san400 0.9 1 sanr200 0.7 sanr200 0.9 sanr400 0.5 sanr400 0.7 DSJC500.5 Target size 21 11 14 16 30 32 12 23 58 14 26 64 126 44 55 64 74 32 4 16 512 4 14 10 46 8 25 36 9 36 50 11 44 29 18 69 59 43 99 18 42 13 21 13 # of edges 14 834 9 876 12 048 13 089 59 681 59 765 1 534 3 235 8 473 4 459 9 139 23 191 46 627 17 910 17 910 71 820 71 820 1 824 704 20 864 518 656 210 1 855 122 253 244 799 10 933 21 928 33 390 31 569 62 946 93 800 60 999 121 728 13 930 13 930 17 910 17 910 17 910 71 820 17 863 17 863 39 984 55 869 125 248 # of target cliques 2 14 24 30 31 33 14 26 3 19 19 3 3 4 4 71 77 2 240 480 2 105 30 276 491 13 52 10 78 14 62 2 138 30 2 70 61 50 100 13 21 4 140 56 # of satisfying assignments 1.02181884 × 1020 558 835 200 2 092 278 988 800 6.27683697 × 1014 8.22283865 × 1033 8.68331762 × 1036 6 706 022 400 6.72152435 × 1023 7.05168399 × 1078 1 656 387 532 800 7.66253776 × 1027 3.80660797 × 1089 7.11651973 × 10211 1.06330863 × 1055 5.07856134 × 1073 9.00897219 × 1090 2.54707179 × 10109 5.26261674 × 1035 5 760 1.00429391 × 1016 (512)! × 2 2 520 2 615 348 736 000 1 001 548 800 2.70178748 × 1060 524 160 8.06582922 × 1026 3.71993327 × 1042 28 304 640 5.20790658 × 1042 1.88567378 × 1066 79 833 600 3.66841477 × 1056 2.6525286 × 1032 1.28047474 × 1016 1.19785717 × 10100 8.45967023 × 1081 3.02076315 × 1054 9.33262154 × 10157 8.32308582 × 1016 2.95051285 × 1052 24 908 083 200 7.1527319 × 1021 348 713 164 800 # of clauses 51 287 23 235 28 866 34 027 206 149 231 267 33 978 71 888 353 685 172 805 291 137 1 141 623 4 078 749 200 034 310 045 840 044 1 118 054 34 016 1 956 46 608 134 485 504 452 7 924 432 257 1 335 747 44 725 120 447 211 296 115 690 394 840 668 500 229 862 815 966 92 999 40 188 485 059 356 049 191 233 1 988 079 40 250 182 679 76 229 116 352 107 639 # of variables 4 200 2 200 2 800 3 200 12 000 12 800 2 400 4 600 11 600 7 000 13 000 32 000 63 000 8 800 11 000 25 600 29 600 2 048 256 4 096 524 288 112 980 10 000 46 000 2 400 7 500 10 800 4 500 18 000 25 000 7 700 30 800 5 800 3 600 13 800 11 800 8 600 39 600 3 600 8 400 5 200 8 400 6 500 Table 3.1: The size of encoded DIMACS MAX-CLIQUE graphs as SAT instances using the encoding of Section 3.1. The relatively small instances are shown in bold. 29 Chapter 3. Near-uniform sampling by translating k-CLIQUE to SAT empirically the need and possibility of using short XOR clauses in practice. The average length of XOR clauses determines how quickly XORSample samples a satisfying assignment. In order to find an appropriate range for the average lengths of XOR clauses we manually tried various lengths. The range of lengths of XOR clauses is from 1 to number of variables. The number of SAT variables of the instance johnson8-2-4 encoded as a SAT instance is 140. An XOR clause may contain any one of these 140 variables. We tried various values for m and x: [10, 30], [30, 50], [50, 70], [70, 90], [90, 100], [110, 120] and [120, 140]. The only significant change from one range to another was the time required for sampling each solution. As the number of variables in each XOR clause increases, the time needed to sample each satisfying assignment increases as well. The result of 10 000 samples of the satisfying assignment of the instance johnson8-2-4 encoded as a SAT instance with m = 10 and x = 30 using XORSample is presented in Table 3.2. The number of SAT variables of the instance hamming6-4 encoded as a SAT instance is 320. We tried the following range of number of variables to be included in each XOR clause: [10, 30], [30, 50], [50, 70] and [70, 90]. Again there was no significant change between the ranges of [10, 30], [30, 50] and [50, 70] with respect to various measures of uniformity. We aborted the experiment with the range [70, 90] as it was taking more than 4 CPU hours for 2 000 runs while the experiment with the other 3 ranges each took about 2 CPU hours. The sampling performance of 10 000 runs of XORSample for the instance hamming6-4 encoded as a SAT instance with m = 10 and x = 30 is presented in Table 3.2. The number of satisfying assignments of an encoded instance is k! (where k is the target size) times the number of target cliques, which means for large values of k we need to use a large number of XOR clauses (large s) to decrease the number of satisfying assignments of the SAT formula to a single satisfying assignment, while as the number of clauses increases the difficulty of the SAT formula typically increases as well. The result of sampling the target cliques of instances johnson8-2-4 and hamming6-4 is shown in Table 3.2. We expected XORSample to sample near-uniformly the solutions of both of the instances because of the theoretical proofs on its sampling performance provided by Gomes et al. The result presented in Table 3.2 is based on 10 000 runs of XORSample. We used minisat [71] as the underlying SAT solver, which was also used by Gomes et al. We consider any distribution of solutions of instances with EMPN /E(n) > 2.5 as biased. EMPN /E(n) is a measure of uniformity defined in Section 30 Chapter 3. Near-uniform sampling by translating k-CLIQUE to SAT Instance johnson8-2-4 hamming6-4 Target size 4 4 # of target cliques 105 240 EMPN /E(n) 1.035 1.077 normalized entropy 0.994 0.832 IN T 10 1.000 0.996 CPU(s)/solution 1.641 2.403 Table 3.2: Sampling two of the encoded DIMACS MAX-CLIQUE graphs using XORSample. Instance johnson8-2-4 hamming6-4 Target size 4 4 # of target cliques 105 240 EMPN /E(n) 1.041 1.319 normalized entropy 1.000 0.992 IN T 10 1.000 1.000 CPU(s)/solution 0.745 1.147 Table 3.3: Sampling the satisfying assignments of two of the DIMACS graphs encoded as SAT instances using XORsample . 1.2 as the empirical number of samples required to sample each solution at least once divided by the expected theoretical number of samples needed to sample each solution at least once. The distribution of solutions of instances with normalized entropy less than 0.9 and IN T 10 not equal to 1 are also considered as biased. IN T 10 is a measure of uniformity defined in Section 1.2 as the fraction of √relative frequency of solutions falling in the interval n+ n2 +400n 1 [ w·n ,w , where n is the number of solutions. All of n ] with w = 20 these choices for thresholds of various measures of uniformity are somewhat arbitrary (see Section 4.3). XORSample sampled the solutions of johnson8-2-4 near-uniformly with respect to any of the 3 measurements as expected; however, the solutions of hamming6-4 are sampled non-uniformly by XORSample with respect to normalized entropy and IN T 10. The time required to sample each target clique of hamming6-4 is also relatively high, when compared to the result of other sampling methods (see Table 4.1). We also sampled the satisfying assignments of the instances johnson82-4 and hamming6-4 encoded as SAT instances using XORSample . Relsat [23] is the underlying model counter that we used for XORSample which was also used by Gomes et al. The result presented in Table 3.3 is based on 10 000 runs of XORSample . XORSample samples the target cliques of johnson8-2-4 and hamming6-4 very uniformly with respect to all three measures of uniformity. These results agree with the theoretical proof on the sampling performance of XORSample provided by Gomes et al. The advantage of XORSample is its quick response in sampling each of the solutions compared with the run times of XORSample. 31 Chapter 3. Near-uniform sampling by translating k-CLIQUE to SAT We used XORSample to sample the solutions of all the other DIMACS MAX-CLIQUE instances shown in Table 3.1 encoded as SAT instances using XORSample , but Relsat [23] took more than 1 minute to count their solutions. 3.4 Conclusion As we have seen in this chapter, XORSample and XORSample are able to sample the target cliques of the instance johnson8-2-4 very uniformly with respect to all of the 3 measures of uniformity. The solutions of instance hamming6-4 were sampled uniformly using XORSample with respect to only EMPN /E(n). This instance was sampled uniformly with respect to all measures of uniformity using XORSample . Both XORSample and XORSample had some limitations in practice. The main problem was the huge size of the encoded instances (see Table 3.1). Another limitation was that sampling satisfying assignments of the encoded SAT instances is expensive, which may be resolved by using different SAT solvers other than minisat [71] and Relsat [23]. A limitation more specifically related to XORSample is the difficulty of tuning its parameters, m and x, which was also disscussed in [72]. This approach of translating a graph to a Boolean formula is not a practical sampling method as many samples are required to discover the amount of uniformity of the distribution of solutions and each sample requires a relatively large amount of time for the majority of the instances of Table 3.1. The common problem of these methods of translating graphs to other problems is the huge size of the encoded instances. Because of their huge size, it is not even possible to store most of the encoded instances (see Table 3.1). One may try other types of encodings in order to overcome this difficulty [43, 65]. We believe the number of solutions of an instance may be a key factor on how quickly a solution is sampled by XORSample. If a SAT instance has huge number of solutions, then XORSample should constrain the instance with enough number of XOR clauses such that the number of solutions reduces to one; however, at the same time having many clauses usually increases the difficulty of the instance. Although XORSample is more efficient than XORSample, as it does not have to iterate many number of times to construct an instance with a unique satisfying assignment, it has different types of limitation. The limitation of this approach is the small varieties of model counters available, as most of 32 Chapter 3. Near-uniform sampling by translating k-CLIQUE to SAT the modern SAT solvers do not support satisfying assignment enumeration at all [73]. As there are also already some counting and sampling procedures [28, 74– 76] for the constraint satisfaction problem (CSP), a possible similar method of sampling near-uniformly the target cliques of graphs is the parsimonious translation of k-CLIQUE to CSP [36] and sampling the solutions of the encoded instance. The encoding we have used in this chapter translate kCLIQUE to SAT by first translating implicitly to CSP. Of course, one may also try to encode the problem of k-CLIQUE as SAT by first translating it to CSP explicitly. There are many CSP to SAT encodings available [37, 77, 78]. 33 Chapter 4 Heuristic MAX-CLIQUE algorithms as samplers In this chapter we examine how uniformly two of the state-of-the-art heuristic MAX-CLIQUE algorithms sample the target cliques of graphs. The input to these MAX-CLIQUE algorithms is a graph consisting of vertices and edges. The heuristic MAX-CLIQUE algorithm introduced by Pullan and Hoos (DLS-MC) is based on a combination of constructive and perturbative local search strategies that uses penalty values associated with vertices to diversify the candidate vertices [2]. As mentioned in Section 2.3, DLS-MC is one of the state-of-the-art algorithms. The instances used for the experiments of this thesis are the DIMACS MAX-CLIQUE graphs; these instances are extensively used for benchmarking purposes in the recent literature on heuristic MAX-CLIQUE algorithms. We chose DLS-MC to sample the target cliques of these graphs because of its performance on the DIMACS set, and also because its parameters (e.g. penalty delay) may be adjusted for a better sampling performance (see Chapter 5). Pullan later introduced a new heuristic MAX-CLIQUE algorithm (PLS) to eliminate the parameters of DLS-MC. PLS interleaves three different procedures called phases, in order to effectively find cliques in graphs with different structures. Each phase of PLS is intended for the construction of one type of target cliques. The first phase selects vertices based on their penalty value in order to find target cliques with low degree vertices. The second phase where target cliques with high degree vertices are generated, is a greedy sub-algorithm. The final phase is for the construction of target cliques in random graphs. In this phase vertices are selected randomly [3]. The rest of this chapter is structured as follows. We first describe DLSMC algorithm in detail in Section 4.1. In Section 4.3 we provide empirical result for sampling the target cliques of DIMACS MAX-CLIQUE graphs using DLS-MC. Next we describe the details of PLS in Section 4.3. The sampling performance of PLS on DIMACS MAX-CLIQUE graphs is pro34 Chapter 4. Heuristic MAX-CLIQUE algorithms as samplers vided in Section 4.4. Both DLS-MC and PLS sample near-uniformly the target cliques of most of the DIMACS MAX-CLIQUE graphs used in this thesis. The advantage of finding target cliques of graphs using DLS-MC is that it has exposed parameters that can be adjusted by the user. When adjusting of the parameters is allowed, more insights may be gained on why an instance is not sampled uniformly. The advantage of PLS is that it is parameter free which is easier to use; however, the corresponding DLS-MC parameters built in PLS may not be optimal for sampling near-uniformly the target cliques of graphs. 4.1 The DLS-MC algorithm DLS-MC is a stochastic local search MAX-CLIQUE algorithm that searches for cliques of a given target size k. DLS-MC has two primary ways of constructing a clique. The first one is an iterative improvement procedure, which expands the current clique by adding a vertex in each iteration. The second one is a plateau search procedure, which swaps the vertices of the current clique with the vertices of the graph not in the clique. The DLS-MC solver has seven parameters: the name of a file containing the input graph in DIMACS format, the number of trials (a trial is a full run of DLS-MC to report a target clique or exhaust the maximum number of selections), a parameter called algorithm to determine how to penalize the vertices, a parameter named restart rule to determine the method of perturbation, maximum number of selections of vertices, target clique size and pd value. Not all of these parameters need to be adjusted for a better search performance. The various options of the algorithm parameter and the restart rule parameter may be used. We may also try different values for pd. Some of the parameters only appear in the implementation which are specified at the beginning of the next section. DLS-MC associates penalty values with the vertices of the given graph to diversify the search. These penalty values are initialized to zero, which is their minimum possible value. DLS-MC always adds a new vertex to the current clique from the set of vertices with minimum penalty value. DLSMC also uses a perturbation technique to avoid search stagnation. DLS-MC starts with a vertex chosen uniformly at random as its initial clique C, then the clique C is expanded during the iterative improvement phase, by iteratively choosing one vertex at a time (if any) that is connected to all the previous vertices of the clique C. The iterative improvement procedure, as we described, is based on a greedy construction mechanism 35 Chapter 4. Heuristic MAX-CLIQUE algorithms as samplers [2]; this could be one of the reasons that DLS-MC is biased towards some of the target cliques in some graphs. If at any point there is no candidate vertex to be added to C, DLSMC leaves the iterative improvement phase, and starts a plateau search. If a vertex x with minimum penalty value exists that is adjacent to all the vertices in C but one, say y, a plateau step swaps x with y. DLS-MC stops the plateau search if one of the following two conditions is met. One condition is that a vertex becomes available to expand the current clique using iterative improvement. The other condition happens when no vertex is available to perform a plateau step. A vertex is available for a plateau move if it is connected to all the vertices of the current clique C but one, and the vertex is not part of the current clique C that the plateau phase has started with; DLS-MC checks the inclusion of this vertex to C by storing the current clique C when it leaves the iterative improvement phase. DLS-MC stops alternating between plateau search and iterative improvement when either a clique of size k is found or neither a plateau step nor an iterative improvement step from the current clique is possible. DLS-MC perturbs the current clique C of size less than k, if no iterative improvement or plateau search is possible. At this point DLS-MC first updates the penalty values of the current clique’s vertices. The vertex penalties are updated by increasing the penalty value of each vertex that was part of the current clique C by one. All the non-zero penalty values are decreased by one after pd times penalty values are updated (where pd is a parameter of DLS-MC) to prevent the penalty values from becoming too large and allowing DLS-MC to ‘forget’ penalty values over time. An update cycle is defined to be each time DLS-MC finds a maximal clique. The algorithm parameter of DLS-MC determines how the vertices of a maximal clique of size smaller than the target size should be penalized. We use 4 for the value of algorithm which is the default value used by Pullan and Hoos. The value of 4 for the ‘algorithm’ parameter of DLS-MC is the one found by Pullan and Hoos to give the best performance for discovering a clique of target size. By choosing the value of the algorithm parameter as 4, penalty values of the vertices of a maximal clique of a size smaller than the target size are only updated at a pd update cycle. With the algorithm parameter of 4 pd obviously needs to be specified. After updating the penalty values, DLS-MC perturbs the current clique C in one of the following two ways depending on the value of pd. It also depends on the value of the restart rule parameter which is discussed in the following paragraph. If pd is greater than one, then the current clique is reduced to the last vertex v that was added to C. With this method 36 Chapter 4. Heuristic MAX-CLIQUE algorithms as samplers of perturbation, DLS-MC focuses on the neighborhood of the clique C but is unlikely to add recently removed vertices back to the clique C as their penalty values have just been increased; This is equivalent to restarting the search from v. The other method of perturbation is used when the pd value is one (i.e. no penalty value). If the pd value is one, then it is very likely that by using the previous method of perturbation DLS-MC reconstructs the same clique. The clique C is perturbed by connecting a new vertex v to C, the vertex v is picked uniformly at random from the graph and removing the vertices of C that are not adjacent to v. The restart rule parameter of DLS-MC has 6 possible choices and determines how DLS-MC should restart after a perturbation. For the pd of 1, the value of restart rule is 1 which means a vertex v is chosen uniformly at random from the graph and the current maximal clique, C, is reduced to a clique containing v and all the vertices of C that are also connected to v. For the values of pd > 1, the value of restart rule is 3 which means the current clique, C, is reduced to the last vertex that was added to C. 4.2 Experimental setup All the experiments for this thesis were performed on a dedicated 2.4 GHz Xeon (dual) machine with 512 KB cache and 1 GB RAM, running SuSE Linux 10.1. In order to find out empirically how uniformly DLS-MC samples the target cliques of a graph, we performed our experiments on many of the commonly used MAX-CLIQUE instances from the Second DIMACS Implementation Challenge (1992 - 1993)2 , which have also been extensively used for benchmarking purposes in the recent literature on MAX-CLIQUE algorithms. The DIMACS MAX-CLIQUE instances were generated from problems in coding theory, fault diagnosis problems, Keller’s conjecture on tilings using hypercubes and the Steiner triple problem, in addition to randomly generated graphs and graphs where the maximum clique has been ’hidden’ by incorporating low-degree vertices. The problem instances range in size from fewer than 28 vertices and 201 edges to > 500 vertices and 518 656 edges. In order to find a good approximation for the distribution of the sampled target cliques within a graph with m target cliques, we should sample individual target cliques a large multiple of m. Hence, we focused on those 2 http://dimacs.rutgers.edu/Challenges/ 37 Chapter 4. Heuristic MAX-CLIQUE algorithms as samplers DIMACS MAX-CLIQUE instances (see Appendix) that have fewer than 1 000 target cliques and for which the solvers (DLS-MC [2] and PLS [3]) samples a target clique of them in at most 1 CPU second on average on our reference machine. This resulted in a set of 44 out of the 80 DIMACS MAX-CLIQUE instances. The target sizes for these instances are either the largest known or optimal [2]. The maximum of number of solutions among the 44 instances used in this thesis is 491 (see Appendix). The number of sampling for each instance is chosen to be 50 000. This number would give us a chance to sample each solution about 100 times. The maximum number of selections for both DLS-MC and PLS is 100 000 000, which is the default value used in [2, 3]. 4.3 Sampling performance of DLS-MC In this section we present the sampling result of DLS-MC on the 44 of 80 DIMACS MAX-CLIQUE instances using the default pd values. The sampling result of DLS-MC with pd values taken from [2] is presented in Table 4.1. These default pd values minimize the run time of discovering target cliques using DLS-MC. As stated in Section 1.2 we use 3 measures of uniformity in order to determine the amount of uniformity for the distribution of target cliques of each graph sampled by DLS-MC. These measures are normalized entropy, IN T 10, and EMPN /E(n). Normalized entropy and IN T 10 are used together to determine whether DLS-MC is sampling the target cliques of graphs uniformly. We consider the distributions of solutions of graphs whose sampling probabilities are not within the interval of IN T 10 as biased. We consider a distribution of sampled target cliques near-uniform, if its value of normalized entropy is > 0.9 and also its IN T 10 value is 1 (i.e. the distribution of sampled solutions is only 10% away from the uniform distribution with respect to the value of normalized entropy and the difference between the least frequent clique and the most frequent clique is at most 0.1). These choices are somehow arbitrary. As shown in Figure 4.1 the majority of the distributions of target cliques of DIMACS MAX-CLIQUE graphs sampled by DLS-MC have normalized entropy > 0.9. Based on normalized entropy and IN T 10 the target cliques of 10 of the 44 DIMACX MAX-CLIQUE instances (22% of them) were sampled non-uniformly emphasized in bold in Table 4.1. Any number < 0.001 is denoted by in Table 4.1. The instances whose target cliques are sampled non-uniformly based on these two measurements are from the 38 Chapter 4. Heuristic MAX-CLIQUE algorithms as samplers Sheet1 1 Cumulative probability 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.4 0.5 0.6 0.7 0.8 0.9 1 Relative entropy Figure 4.1: Cumulative probability distribution of relative entropies across different graphs induced by DLS-MC. following family of graphs: brock, c-fat, gen, and san. The EMPN /E(n) uniformity measurement is the number of runs of an algorithm to sample each target clique at least once divided by the expected theoretical number of runs a uniform sampler requires. If the sampler is uniform, then the value of EMPN /E(n) should be close to 1. The value of 2.5 is used as a threshold for EMPN /E(n) measurement to distinguish the near-uniform samplers. This choice is somehow arbitrary. The cumulative probability distribution of the values of EMPN /E(n) is presented in Figure 4.2. Each point pi (x, y) on the figure represent the probability that the value of EMPN /E(n) for one of the instances is less than or equal to pi (x, 0). As shown in this figure the value of EMPN /E(n) for the majority of the distributions of solutions of the 44 DIMACS MAX-CLIQUE graphs (36 of 44) is less than 2.5. Using this measurement most of the DIMACS MAX-CLIQUE graphs are sampled near-uniformly using DLS-MC as their EMPN /E(n) values are very close to 1, which is well below our threshold of 2.5. DLS-MC did not sample uniformly 8 of the 44 DIMACS MAX-CLIQUE graphs (18% of them) based on this measurement, emphasized in bold in Table 4.1. Each value after ± in Table 4.1 represents the standard deviation from 39 Page 1 Chapter 4. Heuristic MAX-CLIQUE algorithms as samplers Sheet1 1 Cumulative probability 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 4 5 6 7 8 9 10 11 12 EMP(n)/E(n) Figure 4.2: Cumulative probability distribution of EMPN /E(n) across different graphs induced by DLS-MC. the value of EMPN /E(n). The instances whose target cliques are sampled non-uniformly based on this measurement are from the following family of graphs: brock, gen, san, and sanr. As mentioned in Section 1.2 the value of EMPN is evaluated as the average of number of runs that the sampler requires to sample each solution at least once for 100 independent experiments. The distributions of the 100 independent calculations of EMPN /E(n) for two of the instances are presented in Figures 4.3 and 4.4. Based on all 3 of the measurements at the same time, DLS-MC sampled near-uniformly the target cliques of 32 of the 44 DIMACS MAX-CLIQUE graphs used in this thesis (72% of them). DLS-MC with the default pd values used by Pullan and Hoos was not able to sample 12 of the instances near-uniformly (shown in bold in Table 4.1) with respect to some measures of uniformity. The instances that were sampled non-uniformly using DLS-MC based on some of the measures of uniformity are from the following 5 families of DIMACS MAX-CLIQUE instances: brock, c-fat, gen, san and sanr. Most of the brock instances (5 of 6) were sampled non-uniformly based on some measures of uniformity using DLS-MC. All of the instances from the fol40 Page 1 Chapter 4. Heuristic MAX-CLIQUE algorithms as samplers Sheet1 1 Cumulative probability 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.25 5 7.5 10 12.5 15 17.5 20 22.5 25 27.5 EMP(n)/E(n) Figure 4.3: Cumulative probability distribution of 100 values of EMPN /E(n) induced by 100 experiments using DLS-MC for the instance brock200-4. Sheet1 100 Cumulative probability 90 80 70 60 50 40 30 20 10 0 0 0.55 0.65 0.68 0.79 0.82 0.92 0.96 1.09 1.13 1.2 1.23 1.3 1.4 1.57 1.98 EMP(n)/E(n) Figure 4.4: Cumulative probability distribution of 100 values of EMPN /E(n) induced by 100 experiments using DLS-MC for the instance p-hat300 3. 41 Page 1 Chapter 4. Heuristic MAX-CLIQUE algorithms as samplers lowing families of instances in Table 4.1 were sampled near-uniformly using DLS-MC based on any of the 3 measures of uniformity: hamming, johnson, p hat and DSJC. There does not seem to be a meaningful relationship between the number of solutions and the amount of uniformity. For example, instance brock200-1 with 2 solutions is sampled non-uniformly, but the instance hamming8-4 with 480 solutions is sampled very uniformly. The instance hamming6-2 with 2 solutions is sampled very uniformly, but the instance san200 0.9 3 with 50 solutions is sampled non-uniformly. There also did not seem to be a meaningful relationship between the number of edges and the amount of uniformity. For example, the instance hamming102 with 518 656 edges is sampled very uniformly, while the instance c-fat200-2 with 3 235 edges is sampled non-uniformly. The instance johnson8-2-4 with 210 edges is sampled very uniformly, but the instance gen400 p0.9.75 with 71 820 edges is sampled non-uniformly. 4.4 The PLS algorithm Phased local search (PLS) [3] is a local search algorithm based on DLSMC. Unlike DLS-MC, it does not have any parameter that its optimal value depends on the problem instances. As the target cliques of different graphs have different characteristics, various strategies are required to find them. A random selection may find cliques containing a combination of low, average and high degree vertices; a greedy selection method favoring high degree vertices is efficient for discovering cliques containing high degree vertices; and a greedy strategy based on penalty values may be used to report cliques with low degree vertices. PLS, as its name suggests, has three phases. Each of these phases uses one of the methods of selecting vertices described above. PLS starts with a random vertex of the given graph and enters its three phases in sequence repeatedly until either a clique of the target size is constructed or the maximum number of selections is reached. Each of these phases is a series of steps to add vertices from the graph to the current clique when possible, followed by replacing a vertex (called a plateau vertex as it does not change the size of the current clique) from the graph with one of the vertices of the current clique until either no more addition of vertices to the current clique or replacement of vertices is possible (an iteration). At this point, a perturbation is performed and a new iteration is started. Two types of perturbations are performed depending on the phase that 42 Chapter 4. Heuristic MAX-CLIQUE algorithms as samplers 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 Instance brock200 1 brock200 2 brock200 3 brock200 4 brock400 3 brock400 4 c-fat200-1 c-fat200-2 c-fat200-5 c-fat500-1 c-fat500-2 c-fat500-5 c-fat500-10 gen200 p0.9 44 gen200 p0.9 55 gen400 p0.9 65 gen400 p0.9 75 hamming6-2 hamming6-4 hamming8-4 hamming10-2 johnson8-2-4 johnson8-4-4 p hat1000-1 p hat1000-2 p hat300-1 p hat300-2 p hat300-3 p hat500-1 p hat500-2 p hat500-3 p hat700-1 p hat700-2 san200 0.7 1 san200 0.7 2 san200 0.9 1 san200 0.9 2 san200 0.9 3 san400 0.9 1 sanr200 0.7 sanr200 0.9 sanr400 0.5 sanr400 0.7 DSJC500.5 Target size 21 11 14 16 30 32 12 23 58 14 26 64 126 44 55 64 74 32 4 16 512 4 14 10 46 8 25 36 9 36 50 11 44 29 18 69 59 43 99 18 42 13 21 13 pd 2 2 2 2 15 15 1 1 1 1 1 1 1 1 1 1 1 5 5 5 5 5 5 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 # of solutions 2 14 24 30 31 33 14 26 3 19 19 3 3 4 4 71 77 2 240 480 2 105 30 276 491 13 52 10 78 14 62 2 138 30 2 70 61 50 100 13 21 4 140 56 EMPN /E(n) 1.153±0.893 6.856±2.374 6.528±4.710 11.934±3.636 1.028±0.335 0.988±0.312 0.904±0.106 2.374±0.536 0.842±0.162 1.064±0.423 0.896±0.240 1.005±0.695 1.073±0.420 0.835±0.344 0.895±0.334 8.961±4.138 7.000±3.734 0.707±0.157 0.959±0.131 0.978±0.144 1.143±0.560 0.876±0.067 0.923±0.052 2.183±0.809 2.067±0.510 1.116±0.464 1.473±0.554 0.985±0.714 1.821±0.623 1.232±0.449 1.606±0.506 1.063±0.423 1.418±0.384 1.176±0.448 0.903±0.383 2.065±0.955 1.994±0.709 6.011±3.235 1.373±0.363 1.096±0.446 2.840±1.355 0.883±0.433 4.002±1.692 1.683±0.517 normalized entropy 0.725 0.485 0.753 0.784 0.687 0.999 1.000 0.671 0.999 1.000 1.000 1.000 0.999 0.981 0.986 0.775 0.847 1.000 1.000 0.999 1.000 1.000 1.000 0.981 0.963 0.974 0.962 0.963 0.972 0.982 0.972 0.999 0.967 0.992 0.997 0.925 0.875 0.797 0.982 0.963 0.895 0.987 0.955 0.978 IN T 10 0.417 0.433 0.645 1.000 1.000 0.923 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.915 0.974 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.984 0.960 1.000 1.000 1.000 1.000 1.000 1.000 CPU(s)/solution 0.027 0.005 0.004 0.003 0.269 0.341 0.001 0.001 0.001 0.005 0.003 0.002 0.001 0.004 0.001 0.003 0.007 0.004 0.001 0.001 0.002 0.001 0.003 0.032 0.002 0.009 0.062 0.001 0.002 0.005 0.003 0.022 0.064 0.037 0.024 Table 4.1: Sampling performance of DLS-MC with pd values taken from [2]. The instances that are sampled non-uniformly with respect to some measures of uniformity are shown in bold. The values for which the instances are considered as non-uniform are also shown in bold. Any number < 0.001 is denoted by . Each value after ± represents the standard deviation from the value of EMPN /E(n). 43 Chapter 4. Heuristic MAX-CLIQUE algorithms as samplers PLS is performing. For the phase that the selection of vertices is random and also for the phase that its selection of vertices is greedy favoring the high degree vertices, the perturbation technique selects a vertex v uniformly at random from the graph, adds it to the current clique and removes all of the vertices of the current clique that are not connected to v. For the phase that the selection of vertices is based on penalty values favoring low degree vertices, the perturbation mechanism reduces the current clique to a vertex selected uniformly at random from the graph. In the next section we will present the result of sampling the 44 DIMACS MAX-CLIQUE graphs used in this thesis (listed in Appendix A), using PLS. 4.5 Sampling performance of PLS In this section we present the sampling performance of PLS on 44 of 80 DIMACS MAX-CLIQUE instances (see Table 4.2). These 44 instances are the same as those used for DLS-MC. The experiment was performed on the same platform as that in Section 4.2. PLS has four parameters: the name of a file containing the input graph in DIMACS format, the number of trials, maximum number of selection of vertices, and the target clique size. The number of times PLS samples a target clique (i.e. number of trials) is again 50 000 as was used for DLS-MC. For the maximum number of selections, we use the default value used by Pullan which is 100 000 000 [3]. We use the three uniformity measurement from Section 1.2 to assess the sampling performance of PLS. The thresholds of these measures for distinguishing between the near-uniform and bias instances are the same as the thresholds used in Section 4.3. PLS sampled 11 of the 44 DIMACS MAX-CLIQUE instances (25% of them) non-uniformly based on normalized entropy and IN T 10 measurements. The instances whose target cliques are sampled non-uniformly based on these two measurements are from the following family of graphs: brock, c-fat, gen, san, and sanr. Using EMPN /E(n) measurement with the threshold of 2.5, most of the 44 DIMACS MAX-CLIQUE graphs are sampled near-uniformly by PLS; however, 11 of the 44 DIMACS MAX-CLIQUE instances (25% of them) in Table 4.2 are sampled non-uniformly. Each value after ± in Table 4.2 represents the standard deviation from the value of EMPN /E(n). The instances whose target cliques are sampled non-uniformly based on this measurement are from the following family of graphs: brock, c-fat, gen, san, and sanr. 44 Chapter 4. Heuristic MAX-CLIQUE algorithms as samplers PLS was able to sample near-uniformly two of the instances that were sampled non-uniformly by DLS-MC with default pd values taken from [2]. The instances that were sampled near-uniformly by PLS but not DLSMC are brock400 3 and sanr200 0.9. On the other hand, PLS sampled san200 0.9 1 non-uniformly, while it was sampled near-uniformly using DLSMC. For the other instances, DLS-MC sampled them near-uniformly if and only if PLS did so. PLS sampled the instances brock200 1, brock200 2, brock200 3, brock200 4 and san200 0.9 3 with more bias than DLS-MC with respect to all 3 measures of uniformity. The instance c-fat200-2 was almost sampled by PLS with the same bias as DLS-MC but its value of EMPN /E(n) exceeded the threshold of 2.5. PLS sampled instances gen400 p0.9 65, gen400 p0.9 75 and sanr400 0.7 with almost the same bias as DLS-MC. PLS sampled instance san200 0.9 2 with more bias with respect to all of the measurements, and the values of EMPN /E(n) and normalized entropy passed their thresholds. PLS sampled the instance sanr400 0.5 non-uniformly with respect to normalized entropy and IN T 10, while DLS-MC sampled it near-uniformly. The instances that were sampled non-uniformly using PLS based on some of the measures of uniformity are from the following 5 families of DIMACS MAX-CLIQUE instances: brock, c-fat, gen, san and sanr. Most of the brock instances (4 of 6) were sampled non-uniformly using PLS based on some measures of uniformity. All of the instances from the following families of instances in Table 4.2 were sampled near-uniformly using PLS based on any of the 3 measures of uniformity: hamming, johnson, p hat and DSJC. There does not seem to be a meaningful relationship between the number of solutions and the amount of uniformity. For example, instance brock2001 with 2 solutions is sampled non-uniformly, but the instance hamming8-4 with 480 solutions is sampled very uniformly. The instance hamming6-2 with 2 solutions is sampled very uniformly, but the instance san200 0.9 3 with 17 910 edges is sampled non-uniformly. There also did not seem to be a meaningful relationship between the number of edges and the amount of uniformity. For example, the instance hamming10-2 with 518 656 edges is sampled very uniformly, while the instance c-fat200-2 with 3 235 edges is sampled non-uniformly. The instance johnson8-2-4 with 210 edges is sampled very uniformly, but the instance gen400 p0.9.75 with 71 820 edges is sampled non-uniformly. 45 Chapter 4. Heuristic MAX-CLIQUE algorithms as samplers 4.6 Conclusion In this chapter we used two of the state-of-the-art heuristic MAX-CLIQUE algorithms to sample the target cliques of some of the DIMACS MAXCLIQUE instances. Many of these instances were sampled near-uniformly by both of these samplers. There was one instance, san200 0.9 1, that was sampled near-uniformly using DLS-MC but not PLS. On the other hand, there were two instances that were sampled near-uniformly using PLS but not DLS-MC. For the other instances, DLS-MC sampled them near-uniformly if and only if PLS did so. As mentioned in Section 4.4, the instances brock200 1, brock200 2, brock200 3, brock200 4 and san200 0.9 3 were sampled using PLS with more bias than DLS-MC with respect to all 3 measures of uniformity. PLS sampled the instance c-fat200-2 with almost the same bias as DLSMC but its value of EMPN /E(n) exceeded the threshold of 2.5. The instances gen400 p0.9 65, gen400 p0.9 75 and sanr400 0.7 were sampled using PLS with almost the same bias as DLS-MC. The instance san200 0.9 2 was sampled with more bias using PLS with respect to all of the measurements, and the values of EMPN /E(n) and normalized entropy passed their thresholds. The instance sanr400 0.5 was sampled non-uniformly using PLS with respect to normalized entropy and IN T 10, while DLS-MC sampled it nearuniformly. An advantage of PLS is that it requires no adjustment of parameters depending on the problem instances. On the other hand, an advantage of DLS-MC is that by changing the values of its parameters more insights can be gained on why these algorithms sample near-uniformly to propose new ways to improve their sampling performance. In Chapter 5 we will sample near-uniformly the 12 instances that were sampled non-uniformly in Table 4.1 by adjusting the value of pd and also by using a method that performs Monte-Carlo steps. 46 Chapter 4. Heuristic MAX-CLIQUE algorithms as samplers 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 Instance brock200 1 brock200 2 brock200 3 brock200 4 brock400 3 brock400 4 c-fat200-1 c-fat200-2 c-fat200-5 c-fat500-10 c-fat500-1 c-fat500-2 c-fat500-5 gen200 p0.9 44 gen200 p0.9 55 gen400 p0.9 65 gen400 p0.9 75 hamming10-2 hamming6-2 hamming6-4 hamming8-4 johnson8-2-4 johnson8-4-4 p hat1000-1 p hat1000-2 p hat300-1 p hat300-2 p hat300-3 p hat500-1 p hat500-2 p hat500-3 p hat700-1 p hat700-2 san200 0.7 1 san200 0.7 2 san200 0.9 1 san200 0.9 2 san200 0.9 3 san400 0.9 1 sanr200 0.7 sanr200 0.9 sanr400 0.5 sanr400 0.7 DSJC500.5 Target size 21 11 14 16 30 32 12 23 58 126 14 26 64 44 55 64 74 512 32 4 16 4 14 10 46 8 25 36 9 36 50 11 44 29 18 69 59 43 99 18 42 13 21 13 # of solutions 2 14 24 30 31 33 14 26 3 3 19 19 3 4 4 71 77 2 2 240 480 105 30 276 491 13 52 10 78 14 62 2 138 30 2 70 61 50 100 13 21 4 140 56 EMPN /E(n) 5.047±5.310 9.989±3.401 14.312±5.616 28.708±12.759 1.139±0.317 1.091±0.348 1.099±0.221 2.547±0.714 1.120±0.295 0.878±0.224 0.962±0.278 0.837±0.277 0.795±0.382 1.058±0.313 1.045±0.282 8.285±3.845 6.994±5.256 0.977±0.430 1.103±0.593 0.953±0.187 0.968±0.192 0.852±0.127 0.942±0.942 2.017±0.651 1.904±0.464 1.123±0.468 1.434±0.465 1.043±0.397 1.961±0.789 1.041±0.340 1.970±0.673 0.873±0.873 1.409±0.429 1.147±0.496 0.923±0.387 9.918±3.434 5.204±3.099 7.841±3.946 1.569±0.542 1.080±0.451 1.315±0.423 1.694±1.897 3.995±1.483 1.854±0.704 normalized entropy 0.241 0.427 0.715 0.740 0.978 0.989 1.000 0.674 0.999 1.000 1.000 1.000 1.000 0.986 0.986 0.785 0.849 1.000 1.000 1.000 0.999 1.000 1.000 0.983 0.964 0.977 0.963 0.971 0.973 0.980 0.949 0.996 0.971 0.984 0.998 0.854 0.841 0.784 0.976 0.981 0.964 0.839 0.946 0.953 IN T 10 0.375 0.433 1.000 1.000 1.000 0.923 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.887 0.974 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.900 0.984 0.920 1.000 1.000 1.000 0.500 1.000 1.000 CPU(s)/solution 0.008 0.006 0.003 0.002 0.367 0.225 0.001 0.003 0.001 0.001 0.004 0.003 0.004 0.001 0.005 0.008 0.008 0.001 0.002 0.002 0.001 0.009 0.027 0.002 0.004 0.034 0.001 0.001 0.002 0.005 0.002 0.013 0.018 0.016 0.017 Table 4.2: Sampling performance of PLS. The instances that are sampled non-uniformly with respect to some measures of uniformity are shown in bold. The values for which the instances are considered as non-uniform are also shown in bold. Any number < 0.001 is denoted by . Each value after ± represents the standard deviation from the value of EMPN /E(n). 47 Chapter 5 Improvements to DLS-MC Graphs with different structures may require different methods of sampling to sample their target cliques near-uniformly. As shown in Section 4.3 of Chapter 4, DLS-MC with the original pd values taken from [2] was not able to solve S-CLIQUE for some of the DIMACS MAX-CLIQUE instances. In this chapter we introduce some ways of sampling these instances more uniformly. These methods are based on adjustment of a parameter of DLSMC and some modifications to the way DLS-MC selects a vertex. DLS-MC has various parameters. Some of these parameters may be adjusted for a better sampling performance. Its pd parameter was adjusted in [2] for finding large cliques quickly. We show in Section 5.2 that the target cliques of some of the instances are sampled more uniformly when pd values other than the ones used for searching for a target clique efficiently is used. A method inspired by SampleSat [31] is also used in Section 5.3 to sample near-uniformly some of the DIMACS MAX-CLIQUE instances. This method integrates a Monte-Carlo procedure with DLS-MC. Using this method, cliques with some amount of overlap may be sampled more uniformly. This method also improves the uniformity of some of the instances. We also describe two methods that failed to provide any improvements to the sampling performance of DLS-MC, and consider some graphs that their structure does not seem to provide any useful information for improving the uniformity of the sampled solutions. 5.1 Sampling uniformly and graph structures As we will see, different values for the pd parameter affects the amount of uniformity of the distribution of sampled solutions induced by DLS-MC. In this section, we provide an explanation on the effect of pd, based on graph structures. A property of DLS-MC is that in order to construct a clique of size k it adds a vertex (if possible) to an already constructed clique of size k − 1. We propose the following hypothesis based on this fact about DLS-MC. 48 Chapter 5. Improvements to DLS-MC c1 c2 Figure 5.1: Depiction of a graph with two maximum (target) cliques c1 and c2 . Circles c1 and c2 represent the target cliques of the graph. Each circle represents a clique of size less than or equal to the cliques c1 and c2 that share some vertices with them. Hypothesis 1 Sum of sampling probabilities of a group of cliques of sizes smaller than k that are connected to a target clique of size k has a direct impact on the sampling probability of this target clique sampled using DLSMC. The amount of overlap between a clique of size k and cliques of sizes less than k also determines how biased DLS-MC is towards the clique of size k. In order to simplify our analysis, we have empirically examined the relationship between cliques of target size k and cliques of size k − 1 rather than all of the cliques smaller than the target size. For instance, consider the depicted graph of Figure 5.1. In this instance, cliques c1 and c2 are the target cliques with sizes equal to k and the rest of cliques have sizes smaller than k. In a case that the sum of sampling probabilities of the cliques around c1 is almost the same as the sum of sampling probabilities of the cliques around c2 , DLS-MC samples the clique c1 more often than c2 as there is a larger number of smaller cliques around c1 . In the next subsection, we test our hypothesis on two of the DIMACS MAX-CLIQUE instances. 5.1.1 Testing Hypothesis 1 on DIMACS graphs In this subsection we use two of the DIMACS MAX-CLIQUE instances to test Hypothesis 1 from Section 5.1. The graphs chosen for this experiment are from the brock class of DIMACS MAX-CLIQUE instances, namely, brock200-1 and brock200-2. We 49 Chapter 5. Improvements to DLS-MC chose these two instances because they were 2 of the 12 instances that were sampled non-uniformly by DLS-MC with default pd values taken from [2] as shown in Section 4.3. We will see later an analysis for two more of the brock instances. Instance brock200-1 has 200 vertices and 14 834 edges, the size of target cliques of brock200-1 is 21, and it has 2 distinct cliques (b1 and b2 ) of this size. The amount of overlap between a clique of size 21 and a clique of size 20 is calculated as the number of vertices of the clique of size 20 that are shared with the vertices of the clique of size 21 divided by 20. A clique c1 of size 20 is b1 -favoring (i.e., c1 favors sampling of b1 ) if it has more overlap with clique b1 than the other clique of size 21, b2 . In order to determine whether a clique c1 of size 20 has more overlap with b1 , the difference of its amount of overlap with b1 from its amount of overlap with b2 is calculated (this difference is called b1 -difference); if b1 -difference is positive, c1 has more overlap with b1 and is therefore b1 -favoring. Figure 5.2 illustrates this in the form of bar graphs. Each bar represents the number of cliques of size 20 of the instance brock200-1 for which their corresponding b1 -differences fall within the range of numbers specified on the x axis. If the number of cliques of size 20 that are b1 -favoring (positive b1 -difference) are more than the number of cliques of size 20 that are b2 -favoring (negative b1 -difference), then DLS-MC samples b1 more often. For each clique of size 20, we calculated its value of b1 -difference. As we can see from Figure 5.2 many of the differences are negative, which shows that the clique b1 has less amount of overlap with the cliques of size 20. In other words, most of the cliques of size 20 are b2 -favoring. The first clique b1 of size 21 is connected to 184 of the 344 cliques of size 20. The other clique b2 of size 21 is highly connected to many (315) of the cliques of size 20. The sum of sampling probabilities of cliques of size 20 around b2 is larger than the sum of sampling probabilities of cliques of size 20 around b1 , as the cliques of size 20 are sampled with almost equal probability (normalized entropy of 0.952) and there are more cliques of size 20 around b2 . DLS-MC therefore samples the target clique b2 much more often (it is sampled with relative frequency of 0.798). The other instance that we used to test Hypothesis 1 is brock200-2. Instance brock200-2 has 200 vertices and 9 876 edges. Since brock200-2 has only 1 clique of its largest known size 12, the chosen target size for cliques of brock200-2 in this experiment is 11. Instance brock200-2 has 14 cliques of size 11. The cliques of size 11 form two groups. One group contains 12 cliques and the other one 2 cliques. In each group, the cliques share with each other 50 Chapter 5. Improvements to DLS-MC Sheet1 50 Number of cliques of size 20 45 40 35 30 25 20 15 10 5 0 -1 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 b1-differences Figure 5.2: Each bar represents the number of cliques of size 20 of the instance brock200-1 for which their corresponding b1 -differences fall within the range of numbers specified on the x axis. For details, see text. 90% of their vertices, and between the two groups of cliques, the cliques share with each other only 9% of their vertices except one of the cliques in the group containing 12 cliques that is not connected to any of the two cliques in the other group of cliques. As shown in Section 4.3, DLS-MC samples the cliques in the group containing two 11-node cliques 87% of the times. Based on Figure 5.3, we provide one possible explanation on why DLS-MC is biased towards these two cliques of brock200-2. Since the cliques from each group share many of their vertices with each other (90%), we randomly picked one clique from each group as a representative. The amount of overlap between a clique of size 10 and a clique of size 11 is calculated as the number of vertices of the clique of size 10 that are shared with the clique of size 11 divided by 10. For each clique of size 10, we calculated which representative clique of size 11 has more overlap with it. We did so by first calculating the amounts of overlap between the cliques of size 10 and the clique r1 belonging to the cluster with two cliques. We then calculated the amount of overlap of each clique of size 10 and the clique r2 from the group containing 12 cliques. By subtracting the former from the latter, we can find out, r1 -differences 51 Page 1 Chapter 5. Improvements to DLS-MC Number of cliques of size 10 Sheet1 80 75 70 65 60 55 50 45 40 35 30 25 20 15 10 5 0 -1 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 r1-differences Figure 5.3: Each bar represents the number of cliques of size 20 of the instance brock200-1 for which their corresponding r1 -differences fall within the range of numbers specified on the x axis. For details, see text. of the cliques of size 10. Each bar of Figure 5.3 represents the number of cliques of size 10 of the instance brock200-2 for which their corresponding r1 -differences fall within the range of numbers specified on the x axis. If the number of cliques of size 10 that are r1 -favoring (positive r1 -difference) is more than the number of cliques of size 10 that are r2 -favoring (negative r1 -difference), then DLS-MC samples r1 more often. Figure 5.3 shows that most of the bars with positive values are higher than the bars with negative amounts of overlap. Bars with positive amounts of overlap express the number of cliques of size 10 that have more overlap with r1 (i.e., they are r1 -favoring). The relative frequencies of the cliques of size 10 are almost equal (normalized entropy of 0.918). Since most of the cliques of size 10 are r1 -favoring, the two cliques from the group of solution with representative clique r2 are sampled more often (the sum of the relative frequencies of these two cliques is 0.87). Among the cliques of brock200-2, there are some cliques of size 10 that do not overlap with any of the cliques of size 11. These types of cliques do not seem to affect the amount of uniformity among the cliques of size 11, but rather increase the time required for sampling a clique of size 11 by DLS-MC. 52 Page 1 Chapter 5. Improvements to DLS-MC Bald Hairy Figure 5.4: A graph from the family of graphs SG with 4 hairs connected to one of its cliques of size 3. In the Subsection 5.1.3, we perform an experiment under controlled conditions on a graph introduced in Subsection 5.1.2 to show an example of cliques of sizes smaller than the target cliques that do not bias the solver towards any of the target cliques. 5.1.2 Testing Hypothesis 1 using a family of small instances In order to perform a more controlled experiment, we created a family of small graphs, SG, to test Hypothesis 1. The backbone of SG graphs has two cliques of size 3 that are connected to each other by a bridge of ten edges and one of the cliques of size 3 (called Hairy) has extra edges (called hairs) attached to its vertices. The other clique of size 3 (called Bald) has no hair attached to it (see Figure 5.4). Figure 5.5 shows that as we increase the number of hairs (i.e. cliques of size 2) connected to Hairy, the normalized entropy of the solutions decreases that is DLS-MC becomes biased towards Hairy. We translated SG with the least entropy (i.e. when 60 hairs were connected to Hairy) to a SAT instance (with 304 variables and 2 774 clauses) using the encoding described in Section 3.1. This experiment was performed on the same platform as that in Section 4.2. The result of Table 5.1 is based on 1 000 runs of XORSample procedure. The average length of XOR clauses was 40. XORSample sampled Hairy and Bald very uniformly as shown in Table 5.1. One reason that this amount of uniformity is reached is that the structure of this instance is lost after the translation to SAT. 53 Chapter 5. Improvements to DLS-MC Sheet2 1 0.95 0.9 0.85 Relative entropy 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0 10 20 30 40 50 60 Number of hairs Figure 5.5: The effect of increasing the number of hairs connected to the vertices of Hairy on the amount of uniformity of the sampled target cliques. As the number of hairs increases, the normalized entropy decreases. Instance SG with 60 hairs Target size 3 # of target cliques 2 EMPN /E(n) 1.040 normalized entropy 0.999 IN T 10 1.000 CPU(s)/solution 1.501 Table 5.1: The result of sampling the solutions of SG with 60 hairs attached to Hairy using XORSample. The length of XOR clauses is 40. 54 Chapter 5. Improvements to DLS-MC 5.1.3 Not all cliques of size k − 1 have effect on the amount of uniformity In this subsection, we show an example of cliques of sizes smaller than the target cliques that do not bias the solver towards any of the target cliques. The cliques of size k − 1 in this example do not share any of their vertices with the target cliques, and are located in equal distance from both of the target cliques. For this experiment, we used the graph SG with 60 hairs from the previous subsection. This experiment was performed on the same platform as that in Section 4.2. DLS-MC samples the target cliques, Hairy and Bald, of this instance non-uniformly with relative frequencies of 0.85 and 0.15, respectively. The average time required to sample each target clique was 0.0001 CPU second. We added a string of 10 edges to the vertex that is located exactly in the center of the bridge between Hairy and Bald to create a new instance, and attached in two stages 200 vertices to the vertex at the tip of this branch (see Figure 5.6). After adding 100 vertices to the tip of the branch, the relative frequencies of these two target cliques stayed almost unchanged (0.88 and 0.12) but the average time required to sample each solution increased to 0.0024 CPU seconds. When we attached another 100 vertices to the vertex at the tip of the branch, again the change in the relative frequencies was negligible (0.853 and 0.147) and the average time to sample each target clique increased again (0.01283 CPU seconds). 5.1.4 Modifying a DIMACS MAX-CLIQUE instance, p hat700-1, to test Hypothesis 1 We also made some modifications to one of the DIMACS MAX-CLIQUE instances, p hat700-1, in order to test Hypothesis 1 described in Section 5.1. Instance p hat700-1 has 700 vertices and 60 999 edges. The target (optimal) clique size of p hat700-1 is 11, and it has 2 cliques of this size. The experiments of this section were performed on the same platform as that in Section 4.2. DLS-MC samples the two cliques of size 11 of p hat700-1 uniformly (with normalized entropy of 1 as shown in Table 4.1). We manually added cliques of size 10 (i.e. target size minus 1) to only one of the two target cliques of p hat700-1. In order to add a clique of size 10 to a clique of size 11, a new vertex is added to the graph and then this vertex is connected to 9 of the vertices of the clique of size 11. It is interesting to know that the two target cliques of instance p hat700-1 differ only by a single vertex. Therefore, 55 Chapter 5. Improvements to DLS-MC 1 8 Bald Hairy 60 1 1 t ... 200 ... Figure 5.6: Impact of cliques of sizes smaller than the target size that are not connected to any of the target cliques on the amount of uniformity of sampled target cliques. There are 60 vertices connected to Hairy, and there are 200 vertices connected to the vertex with the label t. 56 Sheet1 7 0.8 0.21 0.8 0.75 0.7 Relative frequency 6 0.77 0.23 Chapter 5. Improvements to DLS-MC 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0 1 2 3 4 5 6 7 Number of newly added cliques of size 10 Figure 5.7: Impact of increasing the number of cliques of sizes smaller than the target cliques connected to one of the target cliques on the amount of uniformity of the sampled solutions. The pink line shows the trend of changes in the relative frequency of the target clique as cliques of size 10 are being attached to it. The grey line shows the corresponding changes in the relative frequency of the other clique of size 11 as cliques of size 10 are being attached to the other cliuqe. connecting a new vertex to 9 of the vertices of one of the two target cliques, connects 8 vertices of the other target clique to this vertex as well. Seven cliques of size 10 are added one by one in 7 different experiments. Figure 5.7 shows that as we increase the number of cliques of size 10 that are connected to only one of the two cliques of size 11 of p hat700-1, DLS-MC samples that target clique more often (pink line). This shows that as the sum of relative frequencies of the cliques around one of the two target cliques increases (by increasing their numbers), the probability of sampling that target clique increases. 57 Page 2 Chapter 5. Improvements to DLS-MC 5.2 Sampling uniformly and pd The pd parameter of DLS-MC specifies how often all of the non-zero penalty values are decreased by 1 to avoid high penalty values and allow DLS-MC to ‘forget’ penalty values over time (see Section 4.1). Pullan and Hoos showed that pd can be adjusted for a faster search by DLS-MC [2]. They also showed that pd provides a mechanism for focusing on lower degree vertices [2]. This section examines the effect of pd on the sampling uniformity achieved by DLS-MC. The penalty values are updated if the current clique can not be extended to a larger clique or be replaced with a clique of the same size (i.e. no iterative improvement or plateau search is possible) (see Section 4.1 for more details). Any other cliques to which those vertices belong may now also be sampled less often. A pd value > 1 leads to reduction of penalty values every pd update cycle when the algorithm parameter of DLS-MC is 3. This mechanism can provide a way of preventing the reconstruction of the target cliques that are sampled frequently, for some pd update cycles. For the graphs such as the depicted graph of Figure 5.1, a pd value > 1 may improve the amount uniformity compared to when a pd equal to 1 is used. For an example see the sampling result of the instance gen400 p0.9 75 with pd value 1 in Table 5.4 and compare it with its sampling result from Table 4.1 with pd equals to 5. On the other hand, in some cases a better sampling performance requires the value of pd to be equal to 1 (i.e. no penalization), as in these cases increasing the penalty values associated with the vertices of some cliques in a set of uniformly sampled cliques causes DLS-MC to become biased towards some of the cliques whose penalty values are less (for an example, see instance brock400-3 in Table 5.4 and compare it with the result of Table 4.1 shown in Table 5.4 in parentheses). We believe that if target cliques of a graph are sampled non-uniformly using DLS-MC with pd = 1, using pd values > 1 DLS-MC may give a chance to the less frequently sampled target cliques to be sampled more often. On the other hand, if target cliques of a graph are sampled non-uniformly by DLS-MC with a pd value > 1, using pd = 1 we may be able to eliminate this bias. A good value for pd to sample the target cliques of a graph near-uniformly may be estimated after m independent runs of DLS-MC. The value of m depends on the number of target cliques in the graph. The value of m could be much smaller than the number of runs that would be required to get a 58 Chapter 5. Improvements to DLS-MC good approximation of the distribution of the sampled target cliques using the chosen pd value. Depending on graph structures different pd values may affect the amount of uniformity. For the graphs in which the sum of sampling relative frequencies of each group of cliques of sizes smaller than the target size that are connected to the target cliques is almost equal, a pd value equal to 1 may be the best. For other types of graphs it may be the best to try pd values > 1 if possible. If a set of k-CLIQUE graphs is available with information about their structures, it is possible to determine whether a pd equal to 1 can be used for sampling near-uniformly of their target cliques without further search for an appropriate value of pd. We first examine the effect of different pd values on the sampling performance of DLS-MC for the instances from Subsections 5.1.2 and 5.1.4. The first instance from Subsection 5.1.2 is the one with 60 hairs with two target cliques of size 3 (see Figure 5.4) and the instance from Subsection 5.1.4 is p hat700-1 with 7 new cliques of size 10 attached to one of its two target cliques of size 11. The experiments of this section are performed on the same platform as that in Section 4.2. DLS-MC with pd values from 2 to 10 did not provide any significant improvement for the uniformity of the sampled solutions with respect to any of the 3 measures of uniformity (see Tables 5.2 and 5.3). The number of times a target clique of the instance with 60 hairs is sampled is 1 000 and the number of times a target clique of the other instance is sampled is 50 000. The reason the instance from Subsection 5.1.2 is not sampled more uniformly with pd values > 1 is because after each hair is constructed, using one single plateau step, one of the edges of the target clique attached to the hair is reached and therefore there would be no adjustment of penalty values in practice for this instance. In a similar case after constructing one of the edges of the bridge between the two target cliques by performing a series of plateau steps one of the edges of the target cliques is reached and therefore again no adjustment of penalty values is possible. The reason the modified instance p hat700-1 from Subsection 5.1.4 is not sampled more uniformly with pd values > 1 is that its two target cliques share all of their vertices except 1 and consequently increasing the penalty values of the vertices of the more frequently sampled target clique leads to increasing the penalty values of the vertices of the other clique. In order to see the effect of different pd values on the amount of uniformity, we also used the 12 DIMACS MAX-CLIQUE instances that were 59 Chapter 5. Improvements to DLS-MC pd EMPN /E(n) normalized entropy IN T 10 CPU(s)/solution 1 3.167±2.500 0.529 1.000 0.000 2 6.547±4.506 0.491 1.000 0.000 3 5.180±2.276 0.472 1.000 0.000 4 2.057±0.291 0.430 1.000 0.000 5 2.813±0.640 0.526 1.000 0.000 6 1.817±0.756 0.456 1.000 0.000 7 1.347±0.333 0.491 1.000 0.000 8 5.970±3.176 0.512 1.000 0.000 9 8.933±3.083 0.453 1.000 0.000 10 3.563±1.085 0.485 1.000 0.000 Table 5.2: Sampling performance of DLS-MC with various pd values for the instance with 60 hairs from Subsection 5.1.2. pd EMPN /E(n) normalized entropy IN T 10 CPU(s)/solution 1 1.213±0.788 0.745 0.000 0.015 2 1.740±1.240 0.741 0.000 0.015 3 1.233±1.086 0.740 0.000 0.015 4 1.707±2.021 0.740 0.000 0.015 5 1.297±0.975 0.729 0.000 0.015 6 1.520±1.501 0.733 0.000 0.015 7 1.737±1.326 0.731 0.000 0.015 8 1.453±0.808 0.728 0.000 0.015 9 1.520±1.138 0.715 0.000 0.016 10 1.303±0.827 0.718 0.000 0.016 Table 5.3: Sampling performance of DLS-MC with various pd values for the modified p hat700-1 instance with 7 new cliques of size 10 from Subsection 5.1.4. 60 Chapter 5. Improvements to DLS-MC marked as the instances whose target cliques were sampled non-uniformly in Table 4.1. The total number of times a target clique is sampled by DLS-MC is 50 000. For each of these instances we tried the following pd values 1, 5, 10, 15, 20, 25, 30, 35, 40, 45 and 50 for 500 runs of DLS-MC, and picked the pd that improved the value of normalized entropy the most. This value of pd can be used for further near-uniform sampling of the instance. We also calculated the value of EMPN /E(n) 10 times for each of these 12 instances with these different pd values to find some of the promising pd values. For the values of pd that we suspected to result in a better sampling with respect to EMPN /E(n), we calculated the value of EMPN /E(n) 100 times and took their average. It is possible to have a pd value that improves the amount of uniformity with respect to normalized entropy, but decreases the amount of uniformity with respect to EMPN /E(n). For example, the value of normalized entropy for the instance brock200-2 with pd = 2 is 0.485 and with pd = 40 is 0.613, but the amount of uniformity with respect to EMPN /E(n) with pd = 2 is 6.856 and with pd = 40 equals 15.943. If the time required for DLS-MC to sample a target clique with one of these pd values is more than 1 CPU seconds, we chose not to use that pd value as a large number of samples (50 000) should be performed. All of the results in this section are compared with the results of Table 4.1. Table 4.1 represents the sampling performance DLS-MC with the original pd values taken from [2] on the 44 DIMACS MAX-CLIQUE instances used in this thesis. The result of Table 4.1 for each of these 12 instances is shown in parenthesis in front of each result of Table 5.4. The target cliques of instance brock200-1 are sampled much more uniformly than when pd was 2. The value of normalized entropy went from 0.725 to 0.996, and the value of IN T 10 went from to 1. The value of EMPN /E(n) stayed almost as good as before. The distribution of the target cliques of instance brock200-2 induced by DLS-MC with pd = 25 is more uniform than before with respect to two of the measurements. The values of normalized entropy and IN T 10 went from 0.485 to 0.603 and from to 1 respectively; however, its EMPN /E(n) went from 6.856 to 15.943 and its standard deviation increased as well. The reason instance brock200-2 is sampled with more bias with respect to one of the measures of uniformity could be because EMPN /E(n) is measuring a property of DLS-MC not necessarily compatible with normalized entropy (see Section 1.2). For the instance brock200-3, we again have improvement with respect 61 Chapter 5. Improvements to DLS-MC to normalized entropy and IN T 10. The value of normalized entropy went from 0.753 to 0.876 and the value of IN T 10 went from 0.417 to 0.875. Its target cliques are sampled more biased with respect to EMPN /E(n). Its value of EMPN /E(n) went from 6.528 to 8.14. Instance brock200-4 was also sampled more uniformly with respect to two of the measurements using a pd equals to 30. The value of normalized entropy went from 0.784 to 0.821 and the value of IN T 10 went from 0.433 to 0.633; however, the value of EMPN /E(n) went from 11.934 to 20.366. The time required to sample each target clique was much more than before (it went from 0.003 to 0.774). Instance brock400 3 was sampled much more uniformly with respect to two of the measurements. Its normalized entropy went from 0.687 to 0.998, the values of IN T 10 went from 0.645 to 1. The value of EMPN /E(n) stayed almost as 1. The time required on average to sample a target clique increased from 0.275 to 0.736. The instance gen400 p0.9 75 is sampled more uniformly with respect EMPN /E(n) and IN T 10. Its value of EMPN /E(n) went from 7 to 3.929 and the value of IN T 10 went from 0.974 to 1. Its normalized entropy also was improved (it went from 0.847 to 0.860). Instance san200 0.9 2 was also sampled more uniformly with respect to two of the measurements. Its normalized entropy went from 0.875 to 0.897 and its value of IN T 10 went from 0.984 to 1. The improvements again are not large enough to be considered as significant. Its value of EMPN /E(n) stayed almost as 2. Instance san200 0.9 3 was sampled more uniformly with respect to EMPN /E(n). Its value of EMPN /E(n) went from 6.011 to 5.137. It was also sampled as uniformly as before with respect to the other measurements of uniformity. Instance sanr200 0.9 was sampled more uniformly with respect to two of the measurements. Its value of EMPN /E(n) went from 2.84 to 1.317 and its normalized entropy went from 0.895 to 0.966. The value of IN T 10 stayed as 1. We could not find a value for the pd parameter to significantly improve the amount of uniformity with respect to any of measurements for the following instances: c-fat200-2, gen400 p0.9 65, san200 0.9 3 and sanr400 0.7. We noted that the value of normalized entropy as a function of pd has a unimodal graph (ignoring little perturbations that are due to randomization) for all the DIMACS MAX-CLIQUE instances presented in Table 4.1 that is - after the entropy starts to decrease it does not increase any more (see Figures 5.8, 5.9 and 5.10). When pd value becomes large the penalty values do not decrease very often; therefore, the number of cliques that can not be 62 Chapter 5. Improvements to DLS-MC Instance 1 brock200 1 2 brock200 2 3 brock200 3 4 brock200 4 5 brock400 3 6 c-fat200-2 7 gen400 p0.9 65 8 gen400 p0.9 75 9 san200 0.9 2 10 san200 0.9 3 11 sanr200 0.9 12 sanr400 0.7 Target size 21 11 14 16 30 23 64 74 59 43 42 21 pd 5 (2) 40 (2) 30 (2) 30 (2) 1 (15) 25 (1) 5 (1) 5 (1) 15 (2) 5 (2) 1 (2) 1 (2) # of solutions 2 14 24 30 31 26 71 77 61 50 21 140 EMPN/E(n) 0.893±0.329 (1.153±0.893) 15.943±9.778 (6.856±2.374) 8.140±6.411 (6.528±4.710) 20.366±12.784 (11.934±3.636) 1.020±0.296 (1.028±0.335) 2.394±0.772 (2.374±0.536) 17.537±17.595 (8.961±4.138) 3.929±1.821 (7.000±3.734) 2.136±0.772 (1.994±0.709) 5.137±2.341 (6.011±3.235) 1.317±0.529 (2.840±1.355) 3.766±1.932 (4.002±1.692) normalized entropy 0.996 (0.725) 0.613 (0.485) 0.876 (0.753) 0.821 (0.784) 0.998 (0.687) 0.706 (0.671) 0.811 (0.775) 0.860 (0.847) 0.897 (0.875) 0.803 (0.797) 0.966 ( 0.895) 0.955 (0.955) INT 10 1.000 ( ) 0.500 ( ) 0.875 (0.417) 0.633 (0.433) 1.000 (0.645) 0.923 (0.923) 0.944 (0.915) 1.000 (0.974) 1.000 (0.984) 0.980 (0.960) 1.000 (1.000) 1.000 (1.000) CPU(s)/solution 0.041 (0.027) 0.006 (0.005) 0.003 (0.004) 0.774 (0.003) 0.736 ( 0.269) ( 0.001) 0.008 (0.004) 0.002 (0.001) () 0.002 (0.002) 0.012 (0.022) 0.034 (0.037) Table 5.4: Sampling performance of DLS-MC with the best pd value with respect to normalized entropy and EMPN /E(n) from the following pd values 1, 5, 10, 15, 20, 25, 30, 35, 40, 45 and 50. The values in parenthesis are the result of sampling the solutions taken from Table 4.1, with a pd value that minimizes the time required to sample each target clique. sampled by DLS-MC increases which leads to a decrease in entropy. Three different cases occur as the value of pd changes: 1. The normalized entropy increases to reach a global maximum and then starts to decrease as the value of pd continues to increase (see Figure 5.8 for an example). 2. The normalized entropy decreases as the pd value increases (see Figure 5.9 for an example). 3. The normalized entropy does not change significantly after some pd value (see Figure 5.10 for an example). Overall, DLS-MC with the chosen pd values other than the ones from Table 4.1 sampled 61% of these 12 instances more uniformly with respect to some measures of uniformity emphasized in bold in Table 5.4. The time required for this approach only increased noticeably for one of the instances (258 times), but for the rest of the instances the change was negligible (less than 2.8 times) if any. This approach of adjusting the pd value does not always succeed to sample near-uniformly at low cost, since if we want to penalize the vertices 63 Chapter 5. Improvements to DLS-MC brock200_1 1 0.9 0.8 Entropy 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 2 4 6 8 pd value 10 12 14 16 Figure 5.8: The effect of changing the value of pd on the amount of entropy of the sampled solutions for the instance brock200 1. of a frequently sampled target clique to impede DLS-MC from reconstructing it again, the vertices of the low frequently sampled target cliques that share many of their vertices with this target clique are also penalized. 5.3 Sampling uniformly and the use of a Monte-Carlo procedure integrated with DLS-MC Some graphs have target cliques which are highly connected to each other. For these types of graphs, if DLS-MC with a pd > 1 is used, it results in decreasing the chance of sampling the cliques that are connected to a vertex with a penalty value higher than the penalty values of some other vertices. We need a method of sampling these connected cliques just before the penalty values are increased in order to have a chance to sample the cliques from other groups of target cliques after penalizing the vertices of this group of solutions. 64 Chapter 5. Improvements to DLS-MC Figure 5.9: The effect of changing the value of pd on the amount of entropy of the sampled solutions for the instance brock400 4. 65 Chapter 5. Improvements to DLS-MC Figure 5.10: The effect of changing the value of pd on the amount of entropy of the sampled solutions for the instance brock200 3. 66 Chapter 5. Improvements to DLS-MC As we can see from the following description of the Monte-Carlo procedure, a target clique may not be reachable in a reasonable number of steps using either low or high temperatures, but with a low temperature. if a target clique is given it can search the neighborhood of this target clique efficiently. The sampling method proposed in this section is inspired by SampleSat [31]. We first let DLS-MC finds a target clique and then a Monte-Carlo method with low temperature is used to search the neighborhood of this clique. The Monte-Carlo procedure is described in details as follows. • It first constructs a set of all the vertices that are connected to at least 10% of the vertices of the sampled target clique by DLS-MC. This set is constructed only once and is updated whenever the current clique changes. The value of 10 is somewhat arbitrary. • Then a vertex is chosen uniformly at random from that set. – If the vertex is connected to all of the vertices of the current clique of size s, it will be added to the clique to construct a clique of size s + 1. (This step is possible only if some other Monte-Carlo step(s) has/have already been performed.) – If the vertex is connected to s − 1 of the vertices of the current clique, the new vertex will be replaced by the only vertex that is not adjacent to it to build a new clique of size s. (This step is also possible if some other Monte-Carlo step(s) has/have already been performed.) – Otherwise (i.e., by adding the vertex, the current clique will be reduced to a smaller clique) the vertex is added to the clique (c) with the following probability: Paccept (T, c, c ) := exp( f (c )−f ), T where T is a constant, called temperature, provided by the user, c is the old clique, c is the new clique, and the function f returns the size of a clique. Using low temperature, the current clique with high probability will be reduced to one of the cliques that have large amount of overlap with it. • At this point, any vertex in the graph that is not connected to the new clique is removed from the set of vertices constructed at the beginning of the Monte-Carlo method, and any vertex connected to the new added vertex that is not already in that set is added to the set. 67 Chapter 5. Improvements to DLS-MC Instance 1 brock200 1 2 brock200 2 3 brock200 3 4 brock200 4 5 brock400 3 6 c-fat200-2 7 gen400 p0.9 65 8 gen400 p0.9 75 9 san200 0.9 2 10 san200 0.9 3 11 sanr200 0.9 12 sanr400 0.7 Target size 21 11 14 16 30 23 64 74 59 43 42 21 Temperature 2 10 4 7 3 3 10 10 9 7 7 2 # of selections 14 091 11 857 21 802 30 508 74 758 15 735 8 475 # of solutions 2 14 24 30 31 26 71 77 61 50 21 140 EMPN/E(n) 1.043±0.485 (1.153±0.893) 6.787±2.665 (6.856±2.374) 6.193±2.435 (6.528±4.710) 12.047±3.964 (11.934±3.636) 4.217±1.706 (1.028±0.335) 1.491±0.522 (2.374±0.536) 5.393±3.074 (8.961±4.138) 2.492±1.009 (7.000±3.734) 0.940±0.282 (1.994±0.709) 1.142±0.455 (6.011±3.235) 0.168±0.083 (2.840±1.355) 0.440±0.122 (4.002±1.692) normalized entropy 0.765 (0.725) 0.483 (0.485) 0.757 (0.753) 0.784 (0.784) 0.730 (0.687) 0.727 (0.671) 0.844 (0.775) 0.830 (0.847) 0.884 (0.875) 0.880 (0.797) 0.994 ( 0.895) 0.994 (0.955) INT 10 () () 0.375 (0.417) 0.433 (0.433) 0.935 (0.645) 0.923 (0.923) 0.972 (0.915) 1.000 (0.974) 1.000 (0.984) 1.000 (0.960) 1.000 (1.000) 1.000 (1.000) CPU(s)/run 0.027 (0.027) 0.014 (0.005) 0.011 (0.004) 0.010 (0.003) 0.271 ( 0.269) 0.004 ( 0.001) 0.008 (0.004) 0.003 (0.001) 0.001 ( ) 0.003 (0.002) 0.022 (0.022) 0.022 (0.037) Table 5.5: Sampling performance of DLS-MC integrated with a Monte-Carlo procedure. The ‘# of selections’ is the average number of selections reported in [2], required for DLS-MC with the pd values in [2] to find a target clique. An experiment to investigate the sampling performance of this method of integrating a Monte-Carlo procedure with DLS-MC is as follows. The instances used for this experiment are the ones marked as being sampled non-uniformly in Table 4.1. Table 4.1 presents the sampling performance of DLS-MC with default pd values taken form [2] on some of DIMACS MAXCLIQIUE instances. Temperature is a constant that needs to be specified as a parameter by the user. For the value of temperature we tried the values from 1, 2, ..., 10 for each of the instances and reported the one that improved the value of normalized entropy the most. This value of temperature can be used for further near-uniform sampling of the instance. The total number of times a target clique is sampled by DLS-MC is 50 000, and the number of times a Monte-Carlo step is performed is 1 000. As shown in the column ‘# of selections’ of Table 5.5 DLS-MC with the original pd values requires much higher than 1 000 selections to sample a target clique for most of these instances. The times reported in Table 5.5 are the times required for the 1 000 Monte-Carlo steps and a run of DLS-MC. All of the results in this section are compared with the results of Table 4.1. Perhaps EMPN /E(n) is the best measurement to show the sampling power of this method as many of the cliques in a neighborhood can be sampled by the Monte-Carlo procedure in a run. The empirical result presented in Table 5.5 shows that most of the instances are sampled more uniformly using this method of integrating a 68 Chapter 5. Improvements to DLS-MC Monte-Carlo procedure with DLS-MC when compared to the result of DLSMC with the original pd values. The method introduced in this section that integrates a Monte-Carlo procedure with DLS-MC did not provide any improvement to uniformity of sampled target cliques from the instances of the brock family except for the instance brock400 3 with respect to two of the measurements. The normalized entropy of brock400 3 went from 0.687 to 0.730 and its value of IN T 10 went from 0.645 to 0.935. As we already stated the result of this method is compared with the result of DLS-MC with the original pd values in Table 4.1. As the two target cliques of brock200-1 are not connected to each other this method can not provide any improvement to the distribution of the sampled target cliques. The instance brock200-2 also has a similar situation. It has two groups of solutions that are connected to each other by 9% of their vertices (for further details see Subsection 5.1.1). Here we use two more of the instances from the brock family of DIMACS MAX-CLIQUE graphs that were not sampled using the method that integrates a Monte-Carlo procedure with DLS-MC more uniformly than when they were sampled using DLS-MC with the original pd values. These two instances are brock200-3 and brock200-4. Table 5.6 contains a 24×24 matrix related to the instance brock200-3 with 24 target cliques. Each entry xi,j of this matrix (i.e. the entry from the ith row and the j th column) represents the number of vertices common between the clique i and the clique j of instance brock200-3 divided by the target clique size (14). The target cliques seems to be part of four groups. Cliques 1-7 are part of the first group, clique 8 forms a group by itself, cliques 9-11 and 13-24 are the third group, and clique 12 is also one group by itself. We can see from the matrix in Table 5.7 that the target cliques of instance brock200-4 can also be partitioned into four groups. Table 5.7 is a 30×30 matrix with each entry xi,j representing the number of vertices that are common between clique i and clique j of the instance brock200-4 divided by the target size (16). The four partition of the target cliques of brock200-4 are shown in bold in Table 5.7. The groups are: clique 1 by itself, cliques 2-5 form the second group, cliques 6-13 are the third group, and the cliques 14-30 are in one group. If a clique from any of these groups is sampled by DLS-MC, it is very likely that the Monte-Carlo procedure samples some cliques of the same group. This method of sampling cannot change the probability of reaching a group of target cliques as this is done by DLS-MC in the same manner as before. In this method as it was already explained, we first use DLS-MC 69 Chapter 5. Improvements to DLS-MC 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1 1.00 0.93 0.93 0.86 0.86 0.86 0.79 0.07 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2 0.93 1.00 0.86 0.86 0.93 0.93 0.86 0.07 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3 0.93 0.86 1.00 0.93 0.93 0.79 0.86 0.07 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4 0.86 0.86 0.93 1.00 0.93 0.79 0.86 0.07 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 5 0.86 0.93 0.93 0.93 1.00 0.86 0.93 0.07 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 6 0.86 0.93 0.79 0.79 0.86 1.00 0.93 0.07 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 7 0.79 0.86 0.86 0.86 0.93 0.93 1.00 0.07 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 8 0.07 0.07 0.07 0.07 0.07 0.07 0.07 1.00 0.07 0.07 0.07 0.07 0.07 0.07 0.07 0.07 0.07 0.07 0.07 0.00 0.07 0.07 0.07 0.07 9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.07 1.00 0.93 0.93 0.07 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.07 0.93 1.00 0.93 0.07 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.07 0.93 0.93 1.00 0.07 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 12 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.07 0.07 0.07 0.07 1.00 0.07 0.07 0.07 0.07 0.07 0.07 0.07 0.07 0.07 0.07 0.07 0.00 13 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.07 0.93 0.93 0.93 0.07 1.00 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 14 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.07 0.93 0.93 0.93 0.07 0.93 1.00 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 15 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.07 0.93 0.93 0.93 0.07 0.93 0.93 1.00 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 16 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.07 0.93 0.93 0.93 0.07 0.93 0.93 0.93 1.00 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 17 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.07 0.93 0.93 0.93 0.07 0.93 0.93 0.93 0.93 1.00 0.93 0.93 0.93 0.93 0.93 0.93 0.93 18 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.07 0.93 0.93 0.93 0.07 0.93 0.93 0.93 0.93 0.93 1.00 0.93 0.93 0.93 0.93 0.93 0.93 19 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.07 0.93 0.93 0.93 0.07 0.93 0.93 0.93 0.93 0.93 0.93 1.00 0.93 0.93 0.93 0.93 0.93 20 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.93 0.93 0.93 0.07 0.93 0.93 0.93 0.93 0.93 0.93 0.93 1.00 0.93 0.93 0.93 0.93 21 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.07 0.93 0.93 0.93 0.07 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 1.00 0.93 0.93 0.93 22 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.07 0.93 0.93 0.93 0.07 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 1.00 0.93 0.93 23 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.07 0.93 0.93 0.93 0.07 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 1.00 0.93 24 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.07 0.93 0.93 0.93 0.00 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 1.00 Table 5.6: A matrix that represents the amount of connectivity among the solutions of brock200-3. Each row and column represent one of the cliques. 70 Chapter 5. Improvements to DLS-MC 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 1 1.00 0.19 0.19 0.19 0.19 0.06 0.06 0.06 0.12 0.12 0.00 0.00 0.12 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.00 0.06 0.06 0.06 0.06 0.06 0.06 2 0.19 1.00 0.94 0.88 0.94 0.12 0.12 0.12 0.12 0.12 0.06 0.06 0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.06 0.12 0.06 0.12 0.12 0.12 0.12 3 0.19 0.94 1.00 0.94 0.88 0.12 0.12 0.12 0.12 0.12 0.06 0.06 0.12 0.19 0.19 0.19 0.19 0.19 0.19 0.19 0.19 0.12 0.19 0.12 0.19 0.12 0.19 0.19 0.19 0.19 4 0.19 0.88 0.94 1.00 0.94 0.12 0.12 0.12 0.12 0.12 0.06 0.06 0.12 0.19 0.19 0.19 0.19 0.19 0.19 0.19 0.19 0.12 0.19 0.12 0.19 0.12 0.19 0.19 0.19 0.19 5 0.19 0.94 0.88 0.94 1.00 0.12 0.12 0.12 0.12 0.12 0.06 0.06 0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.12 0.06 0.12 0.06 0.12 0.12 0.12 0.12 6 0.06 0.12 0.12 0.12 0.12 1.00 0.94 0.88 0.81 0.88 0.44 0.38 0.56 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.00 0.06 0.06 7 0.06 0.12 0.12 0.12 0.12 0.94 1.00 0.94 0.88 0.94 0.50 0.44 0.62 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.00 0.06 0.06 8 0.06 0.12 0.12 0.12 0.12 0.88 0.94 1.00 0.94 0.88 0.56 0.50 0.69 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.00 0.06 0.06 9 0.12 0.12 0.12 0.12 0.12 0.81 0.88 0.94 1.00 0.94 0.50 0.50 0.62 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.00 0.06 0.06 10 0.12 0.12 0.12 0.12 0.12 0.88 0.94 0.88 0.94 1.00 0.44 0.44 0.56 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.00 0.06 0.06 11 0.00 0.06 0.06 0.06 0.06 0.44 0.50 0.56 0.50 0.44 1.00 0.94 0.62 0.06 0.06 0.06 0.00 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 12 0.00 0.06 0.06 0.06 0.06 0.38 0.44 0.50 0.50 0.44 0.94 1.00 0.56 0.06 0.06 0.06 0.00 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 13 0.12 0.12 0.12 0.12 0.12 0.56 0.62 0.69 0.62 0.56 0.62 0.56 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 14 0.06 0.12 0.19 0.19 0.12 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.00 1.00 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 15 0.06 0.12 0.19 0.19 0.12 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.00 0.94 1.00 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 16 0.06 0.12 0.19 0.19 0.12 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.00 0.94 0.94 1.00 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 17 0.06 0.12 0.19 0.19 0.12 0.06 0.06 0.06 0.06 0.06 0.00 0.00 0.00 0.94 0.94 0.94 1.00 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 18 0.06 0.12 0.19 0.19 0.12 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.00 0.94 0.94 0.94 0.94 1.00 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 19 0.06 0.12 0.19 0.19 0.12 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.00 0.94 0.94 0.94 0.94 0.94 1.00 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 20 0.06 0.12 0.19 0.19 0.12 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.00 0.94 0.94 0.94 0.94 0.94 0.94 1.00 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 21 0.06 0.12 0.19 0.19 0.12 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.00 0.94 0.94 0.94 0.94 0.94 0.94 0.94 1.00 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 22 0.06 0.12 0.12 0.12 0.12 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.00 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 1.00 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 23 0.06 0.12 0.19 0.19 0.12 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.00 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 1.00 0.94 0.94 0.94 0.94 0.94 0.94 0.94 24 0.00 0.06 0.12 0.12 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.00 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 1.00 0.94 0.94 0.94 0.94 0.94 0.94 25 0.06 0.12 0.19 0.19 0.12 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.00 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 1.00 0.94 0.94 0.94 0.94 0.94 26 0.06 0.06 0.12 0.12 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.00 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 1.00 0.94 0.94 0.94 0.94 27 0.06 0.12 0.19 0.19 0.12 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.00 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 1.00 0.94 0.94 0.94 28 0.06 0.12 0.19 0.19 0.12 0.00 0.00 0.00 0.00 0.00 0.06 0.06 0.00 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 1.00 0.94 0.94 29 0.06 0.12 0.19 0.19 0.12 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.00 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 1.00 0.94 30 0.06 0.12 0.19 0.19 0.12 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.00 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 1.00 Table 5.7: A matrix that represents the amount of connectivity among the solutions of brock200-4. Each row and column represent one of the cliques. 71 Chapter 5. Improvements to DLS-MC with the same pd values used in Table 4.1 to find a target clique and then the Monte-Carlo procedure is used to sample the target cliques around the clique discovered by DLS-MC. Therefore, the probability of finding a clique at the beginning and consequently finding a group of solutions is the same as before. However, since we use the Monte-Carlo procedure right after we find a target clique from a group of solutions that are not sampled frequently by DLS-MC, we can hope to sample most of them using the Monte-Carlo procedure. This will help to decrease the time required to find all the target cliques. This method of integrating a Monte-Carlo procedure with DLS-MC sampled the target cliques of instance c-fat200-2 more uniformly with respect to two of the uniformity measurements. Its EMPN /E(n) went from 2.374 to 1.491 and its normalized entropy went from 0.671 to 0.727. The value of the other two measurements stayed almost unchanged. Instance gen400 p0.9 65 was sampled more uniformly with respect to all of the measurements. Its value of EMPN /E(n) went from 8.961 to 5.393, its normalized entropy went from 0.775 to 0.844 and its IN T 10 went from 0.915 to 0.972. The Instance gen400 p0.9 75 was sampled more uniformly with respect to EMPN /E(n) and IN T 10. The value of EMPN /E(n) went from 7.000 to 2.492, and the value of IN T 10 went from 0.974 to 1. Instance san200 0.9 2 was also sampled more uniformly with respect to EMPN /E(n) and IN T 10. The value of EMPN /E(n) went from 1.994 to 0.94. The value of IN T 10 went from 0.984 to 1. Its normalized entropy and stayed almost unchanged. Instance san200 0.9 3 was sampled more uniformly with respect to all of the measurements. The value of EMPN /E(n) went from 6.011 to 1.142, its normalized entropy went from 0.797 to 0.88 and the value of IN T 10 went from 0.96 to 1. Instance sanr200 0.9 was also sampled more uniformly with respect to two of the measurements. The value of EMPN /E(n) went from 2.84 to 0.168, its normalized entropy went from 0.895 to 0.994. The value of IN T 10 stayed as 1. DLS-MC also samples instance sanr400 0.7 more uniformly with respect to two of the measurements. The value of EMPN /E(n) went from 4.002 to 0.44 and the value of normalized entropy went from 0.955 to 0.994. The value of IN T 10 stayed as 1. The time required for each run of this method that integrates a MonteCarlo procedure with DLS-MC did not increase for any of the instances more than 0.014 CPU second of the corresponding times reported in Table 4.1. 72 Chapter 5. Improvements to DLS-MC pd 1 1 EMPN /E(n) 1.300±2.008 (1.347±0.333) 1.300±2.008 (3.167±2.500) normalized entropy 0.964 (0.491) 0.964 (0.529) IN T 10 1.000 (1.000) 1.000 (1.000) CPU(s)/run 0.003 ( ) 0.003 ( ) Table 5.8: Sampling performance of the method that integrates a MonteCarlo procedure with DLS-MC with temperature of 1 for the instance with 60 hairs from Subsection 5.1.2. The best result of Table 5.2 with respect to EMPN /E(n) is shown within parenthesis on the first row. The best result of Table 5.2 with respect to normalized entropy is shown within parenthesis on the second row. Table 5.2 represents the sampling performance of DLSMC with various pd values on the same instance. Within each run multiple solutions may be sampled. The instances c-fat200-2, gen400 p0.9 75, san200 0.9 3, sanr200 0.9 and sanr400 0.7 were sampled more uniformly by the method of Section 5.2 (i.e. adjusting the pd values) than the method that integrates a Monte-Carlo procedure with DLS-MC presented in Section 5.3. Instance sanr200 0.9 was sampled near-uniformly by both of these methods. We also tried this method on the instances from Subsections 5.1.2 and 5.1.4. The experiments for these instances were also performed on the same platform as that in Section 4.2. The sampling performance of the method that integrates a Monte-Carlo procedure with DLS-MC for the instance with two target cliques of size 3 and 60 hairs attached to one of its target cliques from Subsection 5.1.2 based on 1 000 samples with temperature 1 is presented in Table 5.8. For each instance, the value of temperature that optimizes the value of normalized entropy is chosen from 1 to 10. As shown in Table 5.8 using this method we are able to sample the solutions of this instance uniformly. The reason for being able to sample uniformly the target cliques of this instance using this method is that after any of the target cliques is discovered using DLS-MC, the other target clique would also be sampled by taking a series of MonteCarlo steps. The two target cliques are in a neighborhood with distance of only 10 edges (each edge is basically a clique of size 2) away from each other. The modified p hat700-1 instance with 7 new added cliques of size 10 from Subsection 5.1.4 is sampled 50 000 times using this method. As shown in Table 5.9 the two target cliques of this instance are sampled uniformly when temperature of 0 is used. Monte-carlo steps with temperature 0 becomes a series of plateau steps. This amount of uniformity with temperature 73 Chapter 5. Improvements to DLS-MC Temperature 0 1 pd 1 1 EMPN /E(n) 1.307±1.062 (1.213±0.788) 1.090±0.444 (1.213±0.788) normalized entropy 0.990 (0.745) 0.828 (0.745) IN T 10 1.000 ( ) 0.500 ( ) CPU(s)/run 0.016 (0.015) 0.032 (0.015) Table 5.9: Sampling performance of the method that integrates a MonteCarlo procedure with DLS-MC for the modified p hat700-1 instance with 7 new cliques of size 10 from Subsection 5.1.4. The best result of Table 5.3 is shown within parenthesis. Table 5.3 represents the sampling performance of DLS-MC with various pd values on the same instance. equal to 0 is expected as the target clique differ by a single vertex. The result of sampling the target cliques of this instance with temperature equal to 1 is also reported in Table 5.9. An algorithm can be designed to embed both the method that adjusts the value of pd, and this method of using Monte-Carlo steps after a search by DLS-MC to sample the target cliques of a given graph more uniformly. We believe such a method would sample more uniformly as after adjusting the value of pd target cliques from different groups of solutions can be sampled. After a target clique from a group of solutions is sampled using DLS-MC with an adjusted value, the Monte-Carlo procedure can be used to sample the neighboring target cliques in that group. 5.4 Searching for other methods of improvement We also tried some other methods for solving S-CLIQUE, but they failed to provide any improvement for the instances used in this thesis. A brief outline of these methods is described here. One of these methods was again based on DLS-MC. As shown in Table 4.1, DLS-MC has the capability to sample a target clique very quickly. However, the target clique that it finds can be the same one in many runs. This could be because the penalty values of these frequently sampled target cliques are not high enough before a clique is discovered. DLS-MC reinitializes the penalty values of all the vertices of the graph to zero after it finds a target clique (i.e. after a run); therefore, when DLS-MC starts a new run no mechanism exists to reduce the chance of reconstructing the target cliques that were constructed in previous runs. We investigated the use of penalty values of the vertices from the previous runs in order to reduce the chance of reconstructing some of the old cliques. 74 Chapter 5. Improvements to DLS-MC This way we hoped to keep the vertices in the regions with target cliques that are sampled by DLS-MC very often with high penalty values, in order to give a chance to DLS-MC to search for the less frequently sampled target cliques. After some pd cycles again the vertices from the neighborhoods of frequently sampled target cliques will have a chance to be part of a target clique. The problem with this method was that some of the target cliques of some instances would not be discovered at all possibly because some of the vertices of these cliques overlapped with the sampled target cliques, and maintained their high penalty values for a long time. We also tried a pre-processing step to reduce the number of edges of the given graphs before any sampling is performed. Using this pre-processing procedure we hoped to eliminate the cliques of sizes smaller than the target size around the target cliques. An adjacency matrix is a representation for a graph such that each of its elements xi,j is 1 if there is an edge between vertex i and vertex j of the graph and is 0 otherwise. As part of this preprocessing we add a self edge (i.e. a loop) to each vertex of the graph, and consequently addition of 1s on the main diagonal of the adjacency matrix. We will explain the reason for the addition of these loops later. Let k be the target size of cliques in the given graph. A property of adjacency matrices with 1s on their main diagonals is that if they are raised to the power k each element xi,j of the matrix represents the number of paths of sizes less than or equal to k between the vertex i and the vertex j. If the number of paths of size at most k between vertex i and vertex j is less than k k−1 , then we can prune the edge between the vertices i and j; In a clique of size k there is at least k k−1 paths of size at most k between each pair of its vertices. We can perform the same procedure on the adjacency matrices of the new graphs obtained from pruning until no more pruning is possible. A path from vertex i to vertex j is a list of vertices starting with i and ending with j. A path of size k has k edges. As we added loops to all of the vertices, each path of any size less than k can be seen as a path of size k by taking the loops multiple times. A path of size k consists of k + 1 vertices, but since in a path from vertex i to j the first and the last vertices are fixed there are k − 1 of the vertices of the path that can be any of the k vertices (as we allow loops to be part of a path). Therefore, there are at least k k−1 possible paths between any pair of vertices of a clique of size k. For example, consider the graph of Figure 5.11. Its adjacency matrix is represented in Table 5.10. Its adjacency matrix raised to the power t = 3 is shown in Table 5.11. The new graph after pruning of some edges is shown 75 Chapter 5. Improvements to DLS-MC in Figure 5.12. Each element xi,j of the matrix shown in Table 5.11 that has a value less than 33−1 = 9 means that an edge between vertex i and vertex j can not be part of any clique of size 3 leaving us with the graph of Figure 5.12. Unfortunately, however, this method did not provide noticeable reduction in the number of edges of DIMACS MAX-CLIQUE graphs. Using this method on the instance with 60 hairs from Subsection 5.1.2, only the edges on the bridge between the two target cliques are pruned. 5.5 Graph structures with no useful information for sampling near-uniformly In this section, we discuss some graph structures that do not provide any useful information to sample their target cliques near-uniformly. The graphs that we consider are constructed by merging two graphs of the same size with also target cliques of the same size. If we know that the number of edges, number of vertices, the size of the target cliques, and the number of target clique in each of the two merged graphs are the same, there could be some mechanisms to distinguish between the discovered target cliques of the two graphs. For example, we can define an attribute for each vertex to identify its affiliation with any of the two graphs. This way it may be possible to perform an online judgment to determine the amount of resources of the sampler that should be assigned to each of the two subgraphs. However, the sampling procedures that we used do not seem to be able to take advantage of such information, assuming that is possible at all. Only knowing that these two subgraphs are disguised in the graph G does not provide any useful informations for the samplers to sample the target cliques of G more uniformly. There are also some DIMACS MAX-CLIQUE instances that have different sizes but the smaller graph has more target cliques. For example, brock200-2 has 200 vertices and 9 876 edges and the instance p hat700-1 has 700 vertices and 60 999 edges. We are looking for cliques of size 11 in both of these graphs. There are 14 target cliques exist in the graph brock200-2, while there are only 2 target cliques in the larger graph p hat700-1. This suggests that even if the size of one graph was bigger than the other one, it may not help us to bias a sampler towards sampling the target cliques of that subgraph. It may, however, be possible to perform an online judgment to determine the amount of resources of the sampler that should be assigned to each of the two subgraphs. A variant of the above case is when we specifically use DLS-MC [2] as the solver. Let us again assume that our graph G consists of two subgraphs; 76 Chapter 5. Improvements to DLS-MC 3 2 1 4 5 Figure 5.11: A graph with a target clique of size 3. now the question is whether we can make an educated guess about the best value of pd to sample the target cliques of G near-uniformly; our assumption is that DLS-MC samples uniformly the target cliques of the first subgraph with pd value of p1 and the target cliques of the other subgraph with pd value equal to p2 . There are three cases that can happen: if both p1 and p2 are 1, then it seems reasonable to use the value of 1 for the pd parameter of DLS-MC when sampling the target cliques of G. If one of the two value of pd for the subgraphs is 1 and the other one is > 1, it is not clear what would be the best value for the value of pd in order for DLS-MC to sample near-uniformly the solutions of G. If both values of p1 and p2 are > 1, then their average could be a good candidate for the pd parameter of DLS-MC to sample near-uniformly the target cliques of G. The empirical investigation of the ideas presented in this section may be studied in future. 5.6 Conclusion In this chapter we introduced two methods of sampling near-uniformly using DLS-MC. The first method was to adjust the value of pd parameter of DLS- 77 Chapter 5. Improvements to DLS-MC 3 2 1 Figure 5.12: The graph of Figure 5.9 after two of its edges are pruned using the method proposed in Section 5.4. 1 1 1 1 1 1 1 1 0 0 1 1 1 0 0 1 0 0 1 0 1 0 0 0 1 Table 5.10: The adjacency matrix of the graph of Figure 5.11 added to the identity matrix. 15 11 11 7 7 11 9 9 4 4 11 9 9 4 4 7 4 4 4 3 7 4 4 3 4 Table 5.11: The adjacency matrix of the graph of Figure 5.11 raised to the power 3 (i.e. target clique size). 78 Chapter 5. Improvements to DLS-MC MC. Pullan and Hoos showed that some values of pd minimize the time required to find a target clique [2]. Here we empirically showed that for different graph structures different values of pd are required to increase the amount of uniformity of the distribution of their sampled target cliques. A pd value equal to 1 may be the best for the graphs for which the sum of sampling relative frequencies of each group of cliques around target cliques is almost equal. For other types of graphs it may be the best to try pd values > 1 if possible. If a set of k-CLIQUE graphs is available with information about their structures, it is possible to determine whether a pd equal to 1 can be used for sampling near-uniformly of their target cliques without further search for an appropriate value of pd. The second method of sampling proposed in Section 5.3 was inspired by SampleSat [31]. It is using Monte-Carlo with low temperature to sample target cliques in the same neighborhood of a target clique sampled by DLSMC. A neighboring clique of a clique c is defined in this thesis to be a clique that is connected to at least 10% of the vertices of the clique c. In Tables 5.4 and 5.5 each of the 12 instances are sampled uniformly using one of these two methods. Both of these sampling methods provide improvements to the sampling result of Table 4.1. The instances used for these experiments are the 12 DIMACS MAX-CLIQUE instances that DLSMC with pd values taken from [2] did not sample their target cliques uniformly. The method that integrates a Monte-Carlo procedure with DLS-MC introduced in Section 5.3 was only able to sample the instance brock400 3 of the brock family more uniformly than DLS-MC with the original pd values. The improvement in uniformity of the distribution of solutions of the instance gen400 p0.9 65 was also not significant using this method. The other sampling method proposed in Section 5.2 that adjusts the value of pd did sample the brock instances and gen400 p0.9 65 more uniformly than DLS-MC with the original pd values with respect to some of the measures of uniformity. The instances c-fat200-2, gen400 p0.9 75, san200 0.9 3, sanr200 0.9 and sanr400 0.7 were sampled more uniformly by the method of Section 5.2 (i.e. adjusting the pd values) than the method that integrates a Monte-Carlo procedure with DLS-MC presented in Section 5.3. Instance sanr200 0.9 was sampled near-uniformly by both of these methods. While there were improvements for the amount of uniformity in the distribution of the sampled target cliques of the instances, the insights gained about the relation between the types of graphs and the proper method of 79 Chapter 5. Improvements to DLS-MC sampling may be more important. These insights are described in the following two paragraphs. We believe a pd value equal to 1 should be used to sample near-uniformly when the sum of the relative frequencies (resulted from an independent sampling using DLS-MC) of all the smaller cliques connected to each target clique is almost equal. Another case arises for graphs in which the sum of relative frequencies of their cliques around some of target cliques is more than the sum of relative frequencies of cliques around the other target cliques. DLS-MC with a pd value > 1 may be able to penalize the frequently sampled target cliques of these types of graphs and give a chance to the less frequently sampled target cliques to be sampled more often. A procedure which embeds two phases (inspired by PLS [3]), one for sampling near-uniformly using a pd value equal to 1 and another for sampling near-uniformly using some randomly chosen high value of pd, may be able to provide us with an efficient algorithm for sampling near-uniformly the target cliques of a graph. We may also add a third phase that samples near-uniformly using the procedure that integrates a Monte-Carlo procedure with DLS-MC. At the end, one of the sampled solutions would be reported uniformly at random. Investigation of such methods appears to be a promising direction for future work. As explained in Section 5.4, we also examined other possible methods but they failed to provide any significant improvement for the instances used in this thesis. 80 Chapter 6 Conclusions and future work The problem of k-CLIQUE has many applications such as information retrieval, experimental design, signal transmission, computer vision and bioinformatics [6–8]. Stochastic local search is one of the most effective approaches currently known for solving k-CLIQUE problems [2, 18]. The goal of current state-of-the-art stochastic local search algorithms for the k-CLIQUE problem is to discover a target clique as quickly as possible, but some applications require not just a single target clique but rather a variety of them. Examples of such applications were described in Chapter 1. The goal of this thesis was to study various methods for uniform sampling of the target cliques of graphs, and to suggest various ways to improve their degree of uniformity. We were able to reach this goal to some degree, but more importantly we gained some insights into various properties of the algorithms, which help them to sample the target cliques of some graphs more uniformly. We were not looking for a perfectly uniform sampler, but rather a nearuniform sampler as a tool to find many of the target cliques of a given graph quickly. A natural approach for sampling the target cliques of a graph was to use the already available state-of-the-art heuristic MAX-CLIQUE algorithms. The current state-of-the-art heuristic MAX-CLIQUE solvers can solve some very hard instances [2]. A priori, there is no reason to expect these algorithms to sample the target cliques of a given graph completely uniformly. Therefore, we used the following measurements to assess their degree of uniformity: Shannon’s entropy, normalized entropy, IN T 10 and EMPN /E(n). Each one of these measurements captures some aspects of the induced distribution; Shannon’s entropy represents the amount of uncertainty of sampling each target clique, normalized entropy captures how far away we are from the ideal uniform distribution without knowing the total number of target cliques, IN T 10 shows the fraction of sampling probabilities that 1 fall within at most 10 of each other, and EMPN /E(n) is the fraction of empirical number of samples required to sample each target clique at least once, divided by the expected theoretical value of the number of samples 81 Chapter 6. Conclusions and future work required to find each target clique at least once. Before using any heuristic MAX-CLIQUE solver, we tried to take advantage of the already available SAT samplers by translating a given k-CLIQUE instance to a SAT instance. After sampling the satisfying assignments of the encoded SAT instance using the available SAT samplers, the sampled satisfying assignments are translated back to target cliques. As shown in Chapter 3, the majority of the instances encoded in this way are huge. The SAT samplers that we used are XORSample and XORSample , as there were theoretical proofs on their sampling performance provided by Gomes et al. For one of the two instances that were small enough, both the samplers sampled its satisfying assignments very uniformly while for the other instance only XORSample sampled its satisfying assignments very uniformly with respect to all 3 measures of uniformity. The samplers were not able to sample the satisfying assignments of the encoded instances as efficiently as the native k-CLIQUE algorithm, DLS-MC (see Chapter 4). An additional difficulty with XORSample was the hard task of adjusting its parameters. The limitation of XORSample is that there are not many exact counters [73] such as Relsat [23] available. The idea of generalizing XORSample to deal directly with k-CLIQUE graphs was appealing. In order to implement this idea, XORSample types of constraints must be integrated with graphs, such that the resulting constrained graphs can be used as inputs to k-CLIQUE algorithms; This approach limits the discovery of some of the target cliques of the given graphs. We introduced one type of such constraints, which is described here. The XOR type constraint that we proposed is a set S containing a random subset of the vertices of the graph. Each time a clique of size less than or equal to k is constructed, a new constraint S of the type mentioned above is also constructed. Any clique of size less than k can not be extended to a larger clique if an odd (or even) number of its vertices does not belong to the current constraint S. The sampler would randomly decide whether an odd or even number of vertices should be part of the set S with probability of 0.5. This type of constraint that we proposed eliminates the target cliques of the graphs in the same way that the XOR clauses eliminated the Boolean formulas; however, these constraints do not become part of a graph to impede the k-CLIQUE solvers from going to some neighborhoods while doing their search as it was the case for XOR clauses used in XORSample. The constraint that we proposed to use is basically a filter at the very end of the search to reject some of the solutions (rejection sampling). This idea can be implemented in order to investigate its sampling performance empirically. The possible shortcoming of this approach is that the constraints do 82 Chapter 6. Conclusions and future work not reduce the number of target cliques while the sampler is performing its search. One may try to discover an XOR type constraint that becomes part of graphs to eliminate some of the target cliques of a given graph, as the XOR clauses were eliminating some of the satisfying assignments of a given formula. We alternatively used native heuristic MAX-CLIQUE algorithms to overcome the limitations of this generic approach. We chose two of the promising state-of-the-art heuristic MAX-CLIQUE algorithms for sampling, namely, DLS-MC and PLS. DLS-MC is based on a combination of constructive and perturbative local search strategies, and makes use of penalty values associated with the vertices of graphs. PLS as it was already stated has three phases. Each of its phases are designed to find target cliques with different characteristics. For a widely-used set of benchmark instances near-uniform sampling, with respect to all measures of uniformity, can be achieved when these algorithms with their default possible parameters are used. PLS is easier to use as it requires no adjustment of parameters, but its sampling performance is not as good as DLS-MC. DLS-MC has a parameter that, as we have shown in Chapter 5, can have a significant impact on sampling performance of DLS-MC. We believe the success of DLS-MC and PLS to sample near-uniformly the target cliques of many of the DIMACS MAX-CLIQUE instances is due to the ways they construct target cliques. The way vertices are selected by DLS-MC provides a mechanism to improve the amount of uniformity. The selection of vertices during a search by DLS-MC is solely based on vertex penalties that are adjusted dynamically during the search. The other mechanisms built in DLS-MC that may have impact on sampling near-uniformly are the perturbation of the current clique to overcome search stagnation, and the plateau steps that it takes during a search. DLS-MC performs perturbations in two different ways, which were explained in Section 4.1. During a plateau search vertices of the current clique are swapped with vertices not included in the current clique to form a new clique of the same size. PLS uses three different methods of selecting the vertices in three different phases. Each of its phases are designed to find target cliques with different characteristics. One phase is for finding target cliques in random graphs, the second phase is for finding target cliques with low degree vertices, and the last phase for finding target cliques with high degree vertices. Some of the DIMACS MAX-CLIQUE instances used in this thesis were not sampled near-uniformly using DLS-MC with the original pd values taken 83 Chapter 6. Conclusions and future work from [2] (pd is a parameter of DLS-MC that determines how often the penalty values should be decreased). These values of pd were adjusted to minimize the time required to report a target clique by Pullan and Hoos. We showed that some of these instances can be sampled more uniformly using a pd value other than the ones used in [2]. For graphs in which the sum of relative frequencies (as determined by an independent sampling using DLS-MC) of all the smaller cliques connected to each target clique is almost equal, a pd value equal to 1 should be used to sample near-uniformly. Another type of graphs would be graphs for which the sum of relative frequencies of their cliques around some of target cliques is more than the sum of relative frequencies of cliques around their other target cliques. DLSMC with a pd value > 1 may be able to penalize the frequently sampled target cliques of these types of graphs and give a chance to the less frequently sampled target cliques to be sampled more often. As we mentioned in Chapter 5, these insights suggest that a procedure which embeds two phases (inspired by PLS [3]), one for sampling near-uniformly using a pd value equal to 1 and another for sampling nearuniformly using some randomly chosen value of pd > 1, may be able to efficiently sample the target cliques of a graph near-uniformly. We may also add a third phase that samples near-uniformly using our method that integrates a Monte-Carlo procedure with DLS-MC. At the end, one of the sampled solutions would be reported uniformly at random. If a graph has some groups of target cliques that overlap with each other, a method inspired by SampleSat [31] can be used to sample near-uniformly the target cliques in each group of connected target cliques. The description of this sampling approach is as follows. First DLS-MC is used to sample one of the target cliques from one of the groups, and then a Monte-Carlo procedure is used to sample the target cliques overlapping with the target clique sampled by DLS-MC. This method also improves the uniformity of some of the instances. There were other methods of sampling, stated in the conclusion section of Chapter 5, that we tried but they failed to provide any improvement for the instances used in this study. One method was to preserve the penalty values of the vertices from the previous runs in order to reduce the chance of reconstructing some of the old cliques. The limitation of this method was that some of the target cliques of some instances would not be discovered at all, possibly because some of the vertices of these cliques would stay with high penalty values for a long time. 84 Chapter 6. Conclusions and future work Another approach to improve the uniformity of the probability distributions of the sampled target cliques was a pre-processing step (described in Chapter 5) to reduce the size of the given graphs. Unfortunately, however, this method did not provide significant reduction in the size of any of the DIMACS MAX-CLIQUE graphs. The heuristic MAX-CLIQUE algorithms have various mechanisms to construct a target clique. For example, DLS-MC makes use of penalty values associated with vertices. Perhaps the best approach for improving the uniformity of the sampled solutions would be to take advantage of the mechanisms of the existing heuristic MAX-CLIQUE algorithms. Any new sampling procedure based on this approach would have a way of sampling near-uniformly that may be local to the context of S-CLIQUE. Even though we had very good results for a large set of the most popular benchmark of instances, DIMACS, used for comparison of MAX-CLIQUE algorithms, we believe an important step to take is to test the successful methods (adjusting the value of pd and the method that integrates a MonteCarlo procedure with DLS-MC) of this thesis on more instances such as f rb benchmark instances that arose from the SAT’04 competition7 . It is important to consider the fact that these benchmark instances and DIMACS MAX-CLIQUE instances are designed to test the power of MAX-CLIQUE algorithms to find one target clique, but not to test the samplers that are designed to sample the target cliques of these instances uniformly. We are using these instances due to the lack of existence of any benchmark instance specifically designed to examine the sampling performance of S-CLIQUE solvers. As we were able to sample near-uniformly almost all of the DIMACS MAX-CLIQUE instances used in this thesis using various techniques of sampling, we suggest the use of our proposed methods for different applications, such as PPI networks. A possible method of sampling near-uniformly the target cliques of graphs is the parsimonious translation of k-CLIQUE to CSP [36] and sampling the solutions of the encoded instance. There are already some counting and sampling procedures [28, 74–76] available for the problem of CSP. The encoding we have used in this thesis translate k-CLIQUE to SAT by first translating it implicitly to CSP. Of course, one may also try to encode the problem of k-CLIQUE as SAT by first translating it to CSP explicitly. There are many CSP to SAT encodings available [37, 77, 78]. 7 Available benchmarks.htm. at http://www.nlsde.buaa.edu.cn/∼kexu/benchmarks/graph- 85 Chapter 6. Conclusions and future work The method that integrated a Monte-Carlo procedure with DLS-MC proposed in Chapter 5 reports more than one solution per run. We were interested in the power of this algorithm to report the neighboring target cliques. As a direction for sampling near-uniformly, one may measure the sampling performance of stochastic local search algorithms that produce many target cliques within the same run. These types of algorithms require different measures of uniformity. In this thesis we found it necessary in many cases to adjust the values of parameters for the algorithms. As we saw in Chapter 5, some of these parameters can be set based on the structure of graphs. As a possible future direction, one may study the use of automated algorithm design and parameter optimisation techniques [79–82] to build better clique samplers. In some applications, it is the best to use the solvers that are designed to find almost complete subgraphs, quasi-cliques; however, the methods proposed for sampling near-uniformly the target cliques of graphs are well applicable to some of the algorithms for finding quasi-cliques, as some of these quasi-clique algorithms are extended from heuristic MAX-CLQUE solvers [14]. 86 Bibliography [1] C. P. Gomes, A. Sabharwal, and B. Selman, “Near-uniform sampling of combinatorial spaces using XOR constraints,” in Proceedings of the Neural Information Processing Systems (NIPS) (B. Sch¨olkopf, J. C. Platt, and T. Hoffman, eds.), (Cambridge, MA), pp. 481–488, MIT Press, 2006. [2] W. Pullan and H. H. Hoos, “Dynamic local search for the maximum clique problem,” Journal of Artificial Intelligence Research, vol. 25, pp. 159–185, 2006. [3] W. Pullan, “Phased local search for the maximum clique problem,” Journal of Combinatorial Optimization, vol. 12, pp. 303–323, November 2006. [4] R. M. Karp, “Reducibility among combinatorial problems,” in Proceeding of Complexity of Computer Computations (R. E. Miller and J. W. Thatcher, eds.), pp. 85–103, New York and London: Plenum Press, 1972. [5] M. R. Garey and D. S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness. San Francisco, New York: W. H. Freeman, 1979. [6] E. Balas and C. S. Yu, “Finding a Maximum Clique in an Arbitrary Graph,” SIAM Journal on Computing, vol. 15, no. 4, pp. 1054–1068, 1986. [7] Y. Ji, X. Xu, and G. D. Stormo, “A graph theoretical approach for predicting common RNA secondary structure motifs including pseudoknots in unaligned sequences,” Journal of Bioinformatics, vol. 20, no. 10, pp. 1603–1611, 2004. [8] P. A. Pevzner and S.-H. Sze, “Combinatorial approaches to finding subtle signals in DNA sequences,” in Proceedings of the Eighth Inter- 87 Bibliography national Conference on Intelligent Systems for Molecular Biology, (La Jolla / San Diego, CA, USA), pp. 269–278, AAAI Press, 2000. [9] A. Grosso, M. Locatelli, and W. J. Pullan, “Simple ingredients leading to very efficient heuristics for the maximum clique problem,” Journal of Heuristics, vol. 14, no. 6, pp. 587–612, 2008. [10] D. Davidov and A. Rappoport, “Efficient unsupervised discovery of word categories using symmetric patterns and high frequency words,” in ACL-44: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, (Morristown, NJ, USA), pp. 297–304, Association for Computational Linguistics, 2006. [11] R. Horaud and T. Skordas, “Stereo correspondence through feature grouping and maximal cliques,” Institute of Electrical and Electronics Engineers (IEEE) Transactions on Pattern Analysis and Machine Intelligence, vol. 11, no. 11, pp. 1168–1180, 1989. [12] F. Hormozdiari, P. Berenbrink, N. Przulj, and S. C. Sahinalp, “Not all scale free networks are born equal: The role of the seed graph in PPI network emulation,” in Proceeding of Systems Biology and Computational Proteomics (T. Ideker and V. Bafna, eds.), vol. 4532 of Lecture notes in Computer Science, pp. 1–13, Springer, 2006. [13] T. Milenkovic and N. Przulj, “Uncovering biological network function via graphlet degree signatures,” Journal of Cancer informatics, vol. 6, pp. 257–273, 2008. [14] M. Brunato, H. H. Hoos, and R. Battiti, “On effectively finding maximal quasi-cliques in graphs,” in Learning and Intelligent Optimization: Second International Conference, LION 2007 II, Trento, Italy, December 8-12. Selected Papers, (Berlin, Heidelberg), pp. 41–55, SpringerVerlag, 2008. [15] C. Bron and J. Kerbosch, “Algorithm 457: finding all cliques of an undirected graph,” Communications of the ACM, vol. 16, no. 9, pp. 575–577, 1973. [16] J. T. Hastad, “Clique is hard to approximate within n1− ,” in FOCS ’96: Proceedings of the 37th Annual Symposium on Foundations of Computer Science, (Washington, DC, USA), p. 627, Institute of Electrical and Electronics Engineers (IEEE) Computer Society, 1996. 88 Bibliography [17] R. Boppana and M. M. Halld´orsson, “Approximating maximum independent sets by excluding subgraphs,” Journal of Behaviour & Information Technology (BIT), vol. 32, no. 2, pp. 180–196, 1992. [18] I. M. Bomze, M. Budinich, P. M. Pardalos, and M. Pelillo, “The maximum clique problem,” in Proceeding of Handbook of Combinatorial Optimization (P. M. P. Dingzhu Du, ed.), vol. 1, (Dordrecht, The Netherlands), pp. 1–74, Kluwer Academic Publishers, 1999. [19] L. G. Valiant and V. V. Vazirani, “NP is as easy as detecting unique solutions,” in STOC ’85: Proceedings of the seventeenth annual ACM symposium on Theory of computing, (New York, NY, USA), pp. 458– 463, Association for Computing Machinery (ACM), 1985. [20] C. E. Shannon, “A mathematical theory of communication,” SIGMOBILE Mobile Computing and Communications Review, vol. 5, no. 1, pp. 3–55, 2001. [21] P. Flajolet, D. Gardy, and L. Thimonier, “Birthday paradox, coupon collectors, caching algorithms and self-organizing search,” Journal of Discrete Applied Mathematics, vol. 39, no. 3, pp. 207–229, 1992. [22] C. H. Papadimitriou, Computational Complexity. sachusetts: Addison Wesley, November 1993. Reading, Mas- [23] R. J. B. Jr. and J. D. Pehoushek, “Counting models using connected components,” in Proceeding of Association for the Advancement of Artificial Intelligence (AAAI)/Innovative Applications of Artificial Intelligence (IAAI), (Austin, Texas, USA), pp. 157–162, AAAI Press / The MIT Press, 2000. [24] T. Sang, F. Bacchus, P. Beame, H. A. Kautz, and T. Pitassi, “Combining component caching and clause learning for effective model counting,” in Theory and Applications of Satisfiability Testing - SAT 2004, 7th International Conference, SAT 2004, Vancouver, May 10-13. Proceedings, (Vancouver, BC, Canada), 2004. [25] C. P. Gomes, A. Sabharwal, and B. Selman, “Model counting: A new strategy for obtaining good bounds,” in Proceeding of Association for the Advancement of Artificial Intelligence (AAAI), AAAI Press, 2006. [26] M. R. Jerrum, L. G. Valiant, and V. V. Vazirani, “Random generation of combinatorial structures from a uniform,” Journal of Theoretical Computer Science, vol. 43, no. 2-3, pp. 169–188, 1986. 89 Bibliography [27] R. Dechter, K. Kask, E. Bin, and R. Emek, “Generating random solutions for constraint satisfaction problems,” in Eighteenth national conference on Artificial intelligence, (Menlo Park, CA, USA), pp. 15–21, American Association for Artificial Intelligence, 2002. [28] V. Gogate and R. Dechter, “A new algorithm for sampling CSP solutions uniformly at random,” in Proceeding of Constraint Programming (CP) (F. Benhamou, ed.), vol. 4204 of Lecture notes in Computer Science, pp. 711–715, Springer, 2006. [29] V. Gogate and R. Dechter, “Approximate solution sampling (and counting) on AND/OR spaces,” in Proceeding of Constraint Programming (CP) (P. J. Stuckey, ed.), vol. 5202 of Lecture notes in Computer Science, pp. 534–538, Springer, 2008. [30] D. Larkin, “Generating random solutions from a constraint satisfaction problem with controlled probability,” in the 1st International Workshop on Constraints in Functional Verification of Principles and Practice of Constraint Programming, 2002. [31] W. Wei, J. Erenrich, and B. Selman, “Towards efficient sampling: Exploiting random walk strategies,” in Proceeding of Association for the Advancement of Artificial Intelligence (AAAI) (D. L. McGuinness and G. Ferguson, eds.), (San Jose, California, USA), pp. 670–676, AAAI Press / The MIT Press, 2004. [32] J. Argelich, A. Cabiscol, I. Lynce, and F. Many`a, “Encoding MAXCSP into Partial MAX-SAT,” in ISMVL ’08: Proceedings of the 38th International Symposium on Multiple Valued Logic, (Washington, DC, USA), pp. 106–111, Institute of Electrical and Electronics Engineers (IEEE) Computer Society, 2008. [33] A. M. Frisch, T. J. Peugniez, A. J. Doggett, and P. W. Nightingale, “Solving non-boolean satisfiability problems with stochastic local search: A comparison of encodings,” Journal of Automated Reasoning, vol. 35, no. 1-3, pp. 143–179, 2005. [34] I. P. Gent, P. Nightingale, and A. G. D. Rowley, “Encoding quantified CSPs as quantified boolean formulae,” in Proceedings of the 16th Eureopean Conference on Artificial Intelligence (ECAI) (R. L. de M´antaras and L. Saitta, eds.), pp. 176–180, IOS Press, 2004. 90 Bibliography [35] S. D. Prestwich, “Local search on SAT-encoded colouring problems,” in Proceeding of SAT’03 Competition (E. Giunchiglia and A. Tacchella, eds.), vol. 2919 of Lecture notes in Computer Science, pp. 105–119, Springer, 2003. [36] J.-C. R´egin, “Using constraint programming to solve the maximum clique problem,” in Proceeding of Constraint Programming (CP) (F. Rossi, ed.), vol. 2833 of Lecture notes in Computer Science, (Kinsale, Ireland), pp. 634–648, Springer, 2003. [37] T. Walsh, “SAT vs CSP,” in Principles and Practice of Constraint Programming - CP 2000, 6th International Conference, Singapore, September 18-21, 2000, Proceedings (R. Dechter, ed.), vol. 1894 of Lecture notes in Computer Science, pp. 441–456, Springer, 2000. [38] R. Battiti and M. Protasi, “Reactive local search for the maximum clique problem,” Journal of Algorithmica, vol. 29, no. 4, pp. 610–637, 2001. [39] P. Hansen, N. Mladenovi´c, and D. Uroˇsevi´c, “Variable neighborhood search for the maximum clique,” Journal of Discrete Applied Mathematics, vol. 145, no. 1, pp. 117–125, 2004. [40] E. Tomita and T. Kameda, “An efficient branch-and-bound algorithm for finding a maximum clique with computational experiments,” Journal of Global Optimization, vol. 37, no. 1, pp. 95–111, 2007. [41] E. Tomita, A. Tanaka, and H. Takahashi, “The worst-case time complexity for generating all maximal cliques and computational experiments,” Journal of Theoretical Computer Science, vol. 363, no. 1, pp. 28–42, 2006. [42] S. R. Bul` o, A. Torsello, and M. Pelillo, “A continuous-based approach for partial clique enumeration,” in Graph-Based Representations in Pattern Recognition, 6th IAPR-TC-15 International Workshop, GbRPR 2007, Alicante, June 11-13, Proceedings (F. Escolano and M. Vento, eds.), vol. 4538 of Lecture notes in Computer Science, (Alicante, Spain), pp. 61–70, Springer, 2007. [43] F. Heras and J. Larrosa, “A MAX-SAT inference-based pre-processing for max-clique,” in Theory and Applications of Satisfiability Testing - SAT 2008, 11th International Conference, SAT 2008, Guangzhou, May 12-15. Proceedings (H. K. B¨ uning and X. Zhao, eds.), vol. 4996 of 91 Bibliography Lecture notes in Computer Science, (Guangzhou, China), pp. 139–152, Springer, 2008. [44] A. Bertoni, P. Campadelli, and G. Grossi, “A neural algorithm for the maximum clique problem: Analysis, experiments, and circuit implementation,” Journal of Algorithmica, vol. 33, no. 1, pp. 71–88, 2002. [45] G. Nemhauser and G. Sigimondi, “Properties of vertex packings and independence system polyhedra,” Journal of Mathematics Programming, vol. 6, no. 1, pp. 48–61, 1974. [46] M. Pelillo and A. Torsello, “Payoff-monotonic game dynamics and the maximum clique problem,” Journal of Neural Computation, vol. 18, no. 5, pp. 1215–1258, 2006. [47] L. E. Gibbons, D. W. Hearn, P. M. Pardalos, and M. V. Ramana, “Continuous characterizations of the maximum clique problem,” Mathematics of Operations Research, vol. 22, no. 3, pp. 754–768, 1997. [48] S. Busygin, “A new trust region technique for the maximum weight clique problem,” Journal of Discrete Applied Mathematics, vol. 154, no. 15, pp. 2080–2096, 2006. [49] I. S. Dhillon, A new O(N 2 ) algorithm for the symmetric tridiagonal eigenvalue/eigenvector problem. PhD thesis, University of California at Berkeley, Berkeley, CA, USA, 1998. [50] I. M. Bomze, M. Budinich, M. Pelillo, and C. Rossi, “Annealed replication: a new heuristic for the maximum clique problem,” Journal of Discrete Applied Mathematics, vol. 121, no. 1-3, pp. 27–49, 2002. [51] M. Pelillo, “Annealed Imitation: Fast Dynamics for Maximum Clique,” in Neural Networks, 2003. Proceedings of the International Joint Conference on, vol. 1, pp. 55–60, Institute of Electrical and Electronics Engineers (IEEE), 2003. [52] T. Fahle, “Simple and fast: Improving a Branch-and-Bound algorithm for maximum clique,” in ESA ’02: Proceedings of the 10th Annual European Symposium on Algorithms, (London, UK), pp. 485–498, SpringerVerlag, 2002. [53] P. J. Taillon, “A new approach for solving the maximum clique problem,” in Algorithmic Aspects in Information and Management, Second 92 Bibliography International Conference, AAIM 2006, June 20-22, Proceedings (S.-W. Cheng and C. K. Poon, eds.), vol. 4041 of Lecture notes in Computer Science, (Hong Kong, China), pp. 279–290, Springer, 2006. [54] K. Katayama, M. Sadamatsu, and H. Narihisa, “Iterated k-Opt local search for the maximum clique problem,” in Evolutionary Computation in Combinatorial Optimization, 7th European Conference, EvoCOP 2007, April 11-13, Proceedings (C. Cotta and J. I. van Hemert, eds.), vol. 4446 of Lecture notes in Computer Science, (Valencia, Spain), pp. 84–95, Springer, 2007. [55] A. Singh and A. K. Gupta, “A hybrid heuristic for the maximum clique problem,” Journal of Heuristics, vol. 12, no. 1-2, pp. 5–22, 2006. [56] K. Katayama, A. Hamamoto, and H. Narihisa, “An effective local search for the maximum clique problem,” Journal of Information Processing Letters, vol. 95, no. 5, pp. 503–511, 2005. [57] X. Xu, J. Ma, and J. Lei, “An improved ant colony optimization for the maximum clique problem,” in ICNC ’07: Proceedings of the Third International Conference on Natural Computation (ICNC 2007), (Washington, DC, USA), pp. 766–770, IEEE Computer Society, 2007. [58] Z. Bai and Q. Lv, “A leader-based parallel cross entropy algorithm for mcp,” in GECCO ’07: Proceedings of the 2007 GECCO conference companion on Genetic and evolutionary computation, (New York, NY, USA), pp. 2401–2406, ACM, 2007. [59] B. Selman, H. Kautz, and B. Cohen, “Local search strategies for satisfiability testing,” in DIMACS Series in Discrete Mathematics and Theoretical Computer Science, (Providence, RI), pp. 521–532, American Mathematical Society, 1996. [60] H. H. Hoos and T. St¨ utzle, Stochastic Local Search: Foundations and Applications. The Morgan Kaufmann Series in Artificial Intelligence, San Mateo: Morgan Kaufmann, 2004. [61] M. W. Moskewicz, C. F. Madigan, Y. Zhao, L. Zhang, and S. Malik, “Chaff: engineering an efficient SAT solver,” in DAC ’01: Proceedings of the 38th annual Design Automation Conference, (New York, NY, USA), pp. 530–535, ACM, 2001. [62] J. Pearl, Probabilistic Reasoning in Intelligent Systems : Networks of Plausible Inference. Morgan Kaufmann, September 1988. 93 Bibliography [63] R. Dechter, K. Kask, and R. Mateescu, “Iterative Join-Graph Propagation,” in UAI ’02, Proceedings of the 18th Conference in Uncertainty in Artificial Intelligence, University of Alberta, August 1-4 (A. Darwiche and N. Friedman, eds.), pp. 128–136, Morgan Kaufmann, 2002. [64] V. Gogate and R. Dechter, “Studies in Solution Sampling,” in Proceeding of Association for the Advancement of Artificial Intelligence (AAAI) (D. Fox and C. P. Gomes, eds.), pp. 271–276, AAAI Press, 2008. [65] K. Iwama and S. Miyazaki, “SAT-variable complexity of hard combinatorial problems,” in Proceedings of the World Computer Congress of the IFIP, pages 253–258 (Volume 1). Elsevier Science B.V, pp. 253–258, 1994. [66] S. Kugele, “Efficient Solving of Combinatorial Problems using SATSolvers,” Master’s thesis, Fakult¨at f¨ ur Informatik, Technische Universit¨ at M¨ unchen, October 2006. [67] H. H. Hoos, “SAT-encodings, search space structure, and local search performance,” in IJCAI ’99: Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, (San Francisco, CA, USA), pp. 296–303, Morgan Kaufmann Publishers Inc., 1999. [68] D. A. Clark, J. Frank, I. P. Gent, and E. MacIntyre, “Local search and the number of solutions,” Lecture notes in Computer Science (LNCS), vol. 1118, pp. 119–133, 1996. [69] J. Frank, P. Cheeseman, and J. Stutz, “When gravity fails: Local search topology,” Computing Research Repository (CoRR), vol. cs.AI/9712101, 1997. [70] S. D. Prestwich, “Negative Effects of Modeling Techniques on Search Performance,” Annals Operational Research, vol. 118, no. 1-4, pp. 137– 150, 2003. [71] N. E´en and N. S¨ orensson, “An extensible SAT-solver,” pp. 502–518, 2004. [72] C. P. Gomes, J. Hoffmann, A. Sabharwal, and B. Selman, “Short XORs for model counting: From theory to practice,” in Theory and Applications of Satisfiability Testing - SAT 2007, 10th International Conference, May 28-31, Proceedings (J. Marques-Silva and K. A. Sakallah, 94 Bibliography eds.), vol. 4501 of Lecture notes in Computer Science, (Lisbon, Portugal), pp. 100–106, Springer, 2007. [73] S. Khurshid, D. Marinov, I. Shlyakhter, and D. Jackson, “A case for efficient solution enumeration,” in Proceeding Sixth International Conference on Theory and Applications of Satisfiability Testing (SAT 2003), Santa Margherita Ligure, 2003. [74] O. Angelsmark and P. Jonsson, “Improved Algorithms for Counting Solutions in Constraint Satisfaction Problems,” in Principles and Practice of Constraint Programming - CP (F. Rossi, ed.), vol. 2833 of Lecture Notes in Computer Science, (Kinsale, Ireland), pp. 81–95, Springer, 2003. [75] C. P. Gomes, W.-J. van Hoeve, A. Sabharwal, and B. Selman, “Counting CSP solutions using generalized XOR constraints,” in Proceeding of Association for the Advancement of Artificial Intelligence (AAAI), (Vancouver, BC), pp. 204–209, July 2007. [76] K. Kask, R. Dechter, and V. Gogate, “Counting-Based look-ahead Schemes for Constraint Satisfaction,” in Principles and Practice of Constraint Programming - CP 2004, 10th International Conference, CP 2004, September 27 - October 1, Proceedings (M. Wallace, ed.), vol. 3258 of Lecture notes in Computer Science, pp. 317–331, Springer, 2004. [77] M. B. Naoyuki Tamura, “Sugar: A CSP to SAT Translator Based on Order Encoding,” in Proceedings of the Second International CSP Solver Competition, pp. 65–69, 2008. [78] N. Tamura, A. Taga, S. Kitagawa, and M. Banbara, “Compiling Finite Linear CSP into SAT,” in Principles and Practice of Constraint Programming - CP 2006, 12th International Conference, CP 2006, Nantes, France, September 25-29, Proceedings (F. Benhamou, ed.), vol. 4204 of Lecture notes in Computer Science, pp. 590–603, Springer, 2006. [79] H. H. Hoos, “Computer-aided Design of High-Performance algorithms,” Tech. Rep. TR-2008-16, Department of Computer Science,University of British Columbia, December 2008. [80] F. Hutter, H. H. Hoos, K. Leyton-Brown, and K. P. Murphy, “An Experimental Investigation of Model-Based Parameter Optimisation: SPO and beyond,” in Proc. of GECCO-09, 2009. To appear. 95 Bibliography [81] F. Hutter, H. H. Hoos, K. Leyton-Brown, and T. St¨ utzle, “ParamILS: An automatic algorithm Configuration Framework,” Tech. Rep. TR2009-01, Department of Computer Science, University of British Columbia, January 2009. [82] F. Hutter, H. H. Hoos, and T. St¨ utzle, “Automatic algorithm configuration based on local search,” in Proceeding of the Twenty-Second Conference on Artifical Intelligence (AAAI ’07), pp. 1152–1157, 2007. 96 Appendix 97 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 Instance brock200 1 brock200 2 brock200 3 brock200 4 brock400 3 brock400 4 c-fat200-1 c-fat200-2 c-fat200-5 c-fat500-1 c-fat500-2 c-fat500-5 c-fat500-10 gen200 p0.9 44 gen200 p0.9 55 gen400 p0.9 65 gen400 p0.9 75 hamming6-2 hamming6-4 hamming8-4 hamming10-2 johnson8-2-4 johnson8-4-4 p hat1000-1 p hat1000-2 p hat300-1 p hat300-2 p hat300-3 p hat500-1 p hat500-2 p hat500-3 p hat700-1 p hat700-2 san200 0.7 1 san200 0.7 2 san200 0.9 1 san200 0.9 2 san200 0.9 3 san400 0.9 1 sanr200 0.7 sanr200 0.9 sanr400 0.5 sanr400 0.7 DSJC500.5 Number of nodes 200 200 200 200 400 400 200 200 200 500 500 500 500 200 200 400 400 64 64 256 1 024 28 70 1 000 1 000 300 300 300 500 500 500 700 700 200 200 200 200 200 400 200 200 400 400 500 Number of edges 14 834 9 876 12 048 13 089 59 681 59 765 1 534 3 235 8 473 4 459 9 139 23 191 46 627 17 910 17 910 71 820 71 820 1 824 704 20 864 518 656 210 1 855 122 253 244 799 10 933 21 928 33 390 31 569 62 946 93 800 60 999 121 728 13 930 13 930 17 910 17 910 17 910 71 820 17 863 17 863 39 984 55 869 125 248 Target size 21 11 14 16 30 32 12 23 58 14 26 64 126 44 55 64 74 32 4 16 512 4 14 10 46 8 25 36 9 36 50 11 44 29 18 69 59 43 99 18 42 13 21 13 Number of target cliques 2 14 24 30 31 33 14 26 3 19 19 3 3 4 4 71 77 2 240 480 2 105 30 276 491 13 52 14 78 14 62 2 138 30 2 70 61 50 100 13 21 4 140 56 Table 1: The number of edges, vertices and target cliques, and target clique size of each DIMACS MAX-CLIQUE instances used in this thesis. 98
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- On uniform sampling of cliques
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
On uniform sampling of cliques khorvash, Massih 2009
pdf
Page Metadata
Item Metadata
Title | On uniform sampling of cliques |
Creator |
khorvash, Massih |
Publisher | University of British Columbia |
Date Issued | 2009 |
Description | The problem that we are addressing in this thesis is the problem of sampling uniformly the cliques of a target size k of any given graph. As a natural approach for solving this problem, we used the already available state-of-the-art heuristic MAX-CLIQUE solvers. The heuristic MAX-CLIQUE algorithms, which we used for this task, have k-CLIQUE solvers as their subroutines. This thesis therefore examines how uniformly some of the state-of-the-art stochastic local search MAX-CLIQUE algorithms sample target cliques of graphs, and suggests various methods to improve their sampling performance. We also investigate the limitations of a generic method that uses already existing SAT samplers (such as XORSample and XORSample' [1]) to sample the solutions of the translated SAT instances from k-CLIQUE instances. The main limitation with this method is the huge size of the encoded instances. Another important limitation with this method is that sampling satisfying assignments of the encoded SAT instances is expensive. We found that some state-of-the-art heuristic MAX-CLIQUE algorithms (DLS-MC [2] and PLS [3]) are already able to sample near-uniformly the target cliques of many of the instances used in this thesis. Furthermore, we gained some insights into various properties of these algorithms, which helped us to improve the sampling performance on the remaining instances. |
Extent | 816908 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-10-15 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
IsShownAt | 10.14288/1.0051700 |
URI | http://hdl.handle.net/2429/13923 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2009-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2009_fall_khorvash_massih.pdf [ 797.76kB ]
- Metadata
- JSON: 24-1.0051700.json
- JSON-LD: 24-1.0051700-ld.json
- RDF/XML (Pretty): 24-1.0051700-rdf.xml
- RDF/JSON: 24-1.0051700-rdf.json
- Turtle: 24-1.0051700-turtle.txt
- N-Triples: 24-1.0051700-rdf-ntriples.txt
- Original Record: 24-1.0051700-source.json
- Full Text
- 24-1.0051700-fulltext.txt
- Citation
- 24-1.0051700.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0051700/manifest