International Construction Specialty Conference of the Canadian Society for Civil Engineering (ICSC) (5th : 2015)

Topic modeling for infrastructure-related discussions in online social media Nik-Bakht, M.; El-Diraby, T. E. Jun 30, 2015

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
52660-Nik-Bakht_M_et_al_ICSC15_344_Topic_Modeling_For.pdf [ 445.74kB ]
52660-Nik-Bakht_M_et_al_ICSC15_344_Topic_Modeling_For_slides.pdf [ 4.06MB ]
Metadata
JSON: 52660-1.0076448.json
JSON-LD: 52660-1.0076448-ld.json
RDF/XML (Pretty): 52660-1.0076448-rdf.xml
RDF/JSON: 52660-1.0076448-rdf.json
Turtle: 52660-1.0076448-turtle.txt
N-Triples: 52660-1.0076448-rdf-ntriples.txt
Original Record: 52660-1.0076448-source.json
Full Text
52660-1.0076448-fulltext.txt
Citation
52660-1.0076448.ris

Full Text

5th International/11th Construction Specialty Conference 5e International/11e Conférence spécialisée sur la construction    Vancouver, British Columbia June 8 to June 10, 2015 / 8 juin au 10 juin 2015   TOPIC MODELING FOR INFRASTRUCTURE-RELATED DISCUSSIONS IN ONLINE SOCIAL MEDIA M. Nik-Bakht1,2, T. E. El-Diraby1 1 Center for Civil Informatics, Department of Civil Engineering, University of Toronto, Canada  2 mazdak.nikbakht@mail.utioronto.ca Abstract: Decision making for construction of modern civil infrastructure not only involves internal stakeholders, but also aims to include interests of as many external stakeholders as possible. In mega-projects, complexity and diversity of stakeholders call for more advanced communication tools and channels. Extensive prevalence of social web as a two-way communication channel during the last decade has caused a paradigm shift in communication among the e-society, and this has attracted the attention of decision makers in the domain of urban infrastructure among other domains. Although having a wide public outreach, the open and unstructured nature of inputs from the e-society results in chaos and makes it difficult to distil knowledge from the contents communicated by the public. This paper presents tools from topic modeling to process such an unstructured data collected from online social media into information which can be plugged into the process of decision making. We use k-means clustering to cluster followers of an infrastructure project on micro-blogging website Twitter based on semantic similarity among their user profile descriptions. This helps profiling the main groups of followers of the infrastructure project and can provide decision makers with valuable hints regarding typology of external stakeholders. We also extend our analysis to project-related tweets through Latent Semantic Indexing, and find the main topics discussed. The latter guide help decision makers understand the public’s major vested interests in the project. We have applied the proposed method to a Light Rail Transit (LRT) mega-project in Toronto, Ontario and have discussed the results. 1 INTRODUCTION – INFRASTRUCTURE DISCUSSION NETWORKS (IDN) Decision making for construction and development of civil infrastructure in the modern society is involved in a networked procedure. This is not only related to the network of interdependent sub-decisions, random variables involved, and the governing criteria; but also requires interactions among a multitude of decision makers (internal stakeholders), each having their own beliefs, goals, and interests in the project. On the other hand, the external stakeholders (those affected by the decisions) and their vested interests must be considered in decision making, and it adds even more to the complexity of the decision network. The final decision in such a networked scenario will be in form of a ‘package-deal’ (Bruijn and Heuvelhof 2000); a sub-optimal solution which accommodates major interests of different internal and external stakeholders of the project.   One major goal of community engagement practices in the domain of urban infrastructure is to detect the main groups of external stakeholders along with their interests and to involve them in the network of decision making. Public engagement in infrastructure projects has traditionally been carried out through off-line tools such as community meetings, public hearings, questionnaires, etc. or online tools such as 344-1 Web 1.0 portals and project websites which are considered as one-way communication channels to educate the society about the project and its related decisions. On the other hand, as a social-network-related issue, this procedure is influenced, if not re-shaped, by the wind of Web 2.0 (social web). Many public engagement agencies have recently started trying tools such as weblogs and micro-blogging websites (such as Blogger and Twitter), multimedia sharing websites (including YouTube and Instagram), and other tools for a bi-directional communication with the public about infrastructure projects (Bregman and Watkins 2013), (Azhar and Ablen 2014). Social web is not only a portal used to comment on design and planning; rather it is a platform for bringing community members along with their ideas together. Web 2.0–based engagement will hence create a ‘network of people’ together with a ‘network of ideas’. Such networks formed around infrastructure projects are called Infrastructure Discussion Networks – IDNs (Nik Bakht and El-diraby, 2013a&b & 2014).  The formation and evolution of these two interdependent networks (network of people and network of ideas), around the physical network of the infrastructure should be closely and precisely monitored and different trends in them must be followed in order to make the best out of users’ ideas and innovations in design and construction of a sustainable infrastructure. This requires employing tools and methods from network science, information retrieval, and computational linguistics to link people to their ideas and understand their patterns of connectivity over IDNs. One important outcome of monitoring such patterns is to detect cores of interest in a project, in conjunction with the people who support them. Some potential applications of making such a connection between users and ideas over the IDN can be listed as: • Classifying typology of stakeholders, particularly the end users of the  infrastructure system; • Profiling core interests to be involved in package deal solutions offered for infrastructure project; • Detecting relations and interactions among different community groups; • Finding possible communication bottlenecks in the process of public consultation; • Demand detection in a more direct, and pro-active manner; • Activating ‘user innovation’ by understanding and ranking community generated inputs.  Achieving such goals requires detection and analysis of interests for nodes and communities of the IDN. Steinhaeuser & Chawla (2008) suggested assigning interest as node attribute to the social network and then clustering the network based on the similarity of attribute. In a similar study, Kalafatis (2009) harvested Twitter users based on their similar interests through occurrence of some pre-defined keywords in their Twitter biography. To include the degree of importance for nodes, Nik Bakht & El-Diraby (2013a) proposed community detection through analysis of the network topology, and then extracting shared interests within each community through computational linguistics. While segmenting users through the similarity among their attributes is a ‘top-down’ approach, finding communities and then interpreting similarities among them can be considered as a ‘bottom-up’ method. Applications such as topic detection in collaborative tagging systems, reviewed by Papadopoulos et al. (2012), and social trust evaluation for recommender systems, addressed by Pitsilis et al. (2011) are among other examples which combine the two approaches to create tools for profiling and labeling groups of followers. Although the bottom-up approach is helpful for understanding the social construct of e-society around the infrastructure project; it does not necessarily result in cores of interest in the project. In order to detect the major lines of interest to be included in the package-deal solution, the main clusters of ideas discussed by project followers must be detected and highlighted. This paper presents a top-down approach to reach this goal. We start by clustering users and ideas based on their semantic similarity (rather than social connectivity). We use data collected from Micro-blogging website Twitter as the most popular social media used in the domain (Bregman, 2012), (Azhar & Abeln, 2014). By focusing an LRT mega-project in Toronto, Ontario, Canada we use semantic analysis combined with clustering algorithms to detect the main lines of interest in following this project on Twitter. We apply our method to the IDN of the project and cluster its nodes based on the semantic similarity among their profiles. Then we focus on tweets about the project over a certain period of time and use our proposed method to detect the core themes discussed. We will finally show how the results can be a representative of the public community's vested interests in practice. 344-2 2 METHODOLOGY– CLUSTERING BASED ON SEMANTIC SIMILARITY OF USER PROFILES Users in online social network websites normally address their interests through their online activities including contents they share, and comments they leave. However, in many cases they also state their interests directly in user profile descriptions (also known as biography or ‘bio’ for short). Twitter users can indicate their location, professional affiliations, and interests in a short bio in fewer than 160 characters. Such descriptions are accessible through Twitter API (Application Programming Interface). Also, one can use Twitter API to retrieve most recent tweets of a user1. Such tweets can also indirectly reflect users’ interests. The aim of this paper is to partition the IDN into groups of nodes with similar interests.  We collect descriptions/tweets of all nodes of an IDN as a set of ‘pseudo-documents’, and form an analysis corpus. Data clustering techniques combined with semantic approaches are then employed to cluster the corpus based upon the similarity between those pseudo-documents. The method can therefore be simplified as: clustering nodes of an IDN based on semantic similarity among their descriptions collected from their user profiles. This requires cleaning and pre-processing the collected data; modeling it and evaluating the semantic similarity among each pair of data points, and finally clustering nodes to the most similar groups.  2.1 Pre-processing and modeling users as vectors In order to analyze documents they must be first modeled as data-points (vectors) with specific attributes (dimensions) and attribute values (entries). In text mining and natural language processing (NLP), dimensions of analysis vector space (called features in NLP and attributes in data mining) are associated with terms of the corpus. Texts collected via Twitter API, are normally not in a uniform encoding format and also include some ‘noise’. Therefore, the first step in modeling documents is to tokenize, unify, and clean up the collected data. These tasks are called pre-processing and include the following steps:  Cleaning: • Removing html tags and attributes which are not visible in a browser; • Replacing all html character codes (such as &amp, &quot, etc.) with their ASCII equivalents; • Removing all URLs (as most of them are associated with advertisements and commercials); • Removing Twitter specific characters such as hash-tags (#) and mentioning handles (@); • Replacing monetary values and percentages with a trackable variable (for this purpose we transformed all numbers followed or preceded by dollar sign to XXX and all numbers followed or preceded by percentage sign to XX); • Removing all other numbers (as they are less probable to contribute to semantics.  Tokenizing: • Splitting texts at white spaces; • Decomposing clitics and punctuations from their hosts; • Saving ellipsis and other forms of multiple punctuations as separate tokens.  All these activities were performed using RegEx (regular expression). Stemming may also be a part of pre-processing; i.e. all words with the same root could be transformed into their roots. However, this was not done in the present paper due to aggressive nature of most stemmers available and confusions caused by stemming. Apart from the noise, common words with no specific semantics (such as conjunctions, articles, the gibberish, punctuations, etc.) should be filtered out from the corpus. This is normally done by introducing a ‘stop list’ (a complete list of such words). We started by a compilation of some standard stop lists and added new terms to it as the analyses progressed. As the text processing is involved in occurrence and co-occurrence of terms in multiple documents, terms which appear in one document only, were also removed to reduce dimensions of the analysis space.  A dictionary was formed by collecting all tokens of the cleaned corpus. Each term in this dictionary would be one dimension of the analysis space. Project followers were then modeled as vectors in such a vector 1 Currently, up to 3200 most recent tweets by a user are accessible 344-3                                                      space. Assuming the dictionary contains M terms in total (t1 to tM), each user X with description d(X) that is called a pseudo-document here and is specified by a set of attributes (terms) t i, was modeled as an M-dimensional vector with non-zero values in its ith entries only. In the simplest case, d(Xj) i  is 1, if term ti occurs in description of node Xj, and is 0, otherwise. A collection of pseudo documents (modeled as column vectors d(X)) for all members of an IDN of size N would then form an  matrix which is equivalent to ‘term-document’ matrices in topic modeling, and attribute-relation tables in data analysis. Such a matrix takes the following form:    More advanced versions of such a matrix can also be considered; they include bag of words in which entries are positive integers reflecting occurrence frequency of words in pseudo-documents, or using TF-IDF (Term frequency – Inverse Document Frequency) as attribute values, which is a measure describing the level of representativeness of a term for a specific user. 2.2 k-means clustering By considering each node of the IDN as a data-point and each word of the dictionary as an attribute, the abovementioned term-document matrix collects attribute values for all data-points. It can accordingly be used to formulate the top-down clustering as an unsupervised clustering problem. The goal of such a problem is to partition database D into k classes C such that members in each class are the most similar to each other and less similar to members from other classes. Global optimality (exhaustively enumerating all possible partitions), and heuristic methods are the two main options for performing such a task. Heuristic search can be either agglomerative or divisive; while the former starts with individual nodes and looks for the most similar ones to be collected in the same group, the latter starts at the network level and tries to find the best cut-offs. k-means clustering and k-medoids clustering or PAM (Partition around medoids) are two of the best known heuristic methods for clustering. Many studies in the literature have successfully applied k-means to cluster documents or to detect dominant themes in them. Table 1 lists some recent works in clustering short-length texts (such as tweets, URLs, and comments).   Clustering needs a measure of similarity before anything else. Different metrics are used in the literature to measure distance (dissimilarity) among data-points. Minkowski distance (with its two well-known variants: Manhattan distance, and Euclidean distance), cosine distance (calculated based on the angle between two datasets as two vectors), and Jaccard distance can be mentioned among other basic distance metrics. Distance function must be carefully designed associated with the context of use, and by using the domain knowledge. For semantic clustering purposes, two main groups of distances are used: first, calculating distance through corpus statistics; and second, calculating semantic distance of terms though an external ontology or knowledge-base. The latter is calculated based on the distance of two terms in a taxonomy and the former is the vector distance between pseudo-documents calculated either as cosine or as Jaccard distance. In the present problem, these can be calculated as follows:  [1a]   Cosine distance:    [1b]   Jaccard distance:    In which:    is the value of mth attribute (mth term) for user i in the term-document matrix;  is the norm of vector ;  is the number of terms in common between profiles of users i and j; and  is the number of distinct terms in profiles of users i and j.  After selecting a distance measure, k-means algorithm takes the following steps:  • Randomly selects k nodes and takes them as centroids of the k communities; 344-4 • Assigns each node of the IDN to the community of its closest centroid; • Centroids are re-calculated for each community and the second step is repeated;  • The algorithm stops when, for a specific number of iterations the position of centroids do not change beyond a certain threshold (stopping criteria). Table 1 – Semantic clustering of short texts in the literature Author Data-set Domain Goal Features Feature values Distance measure Kennedy et al. (2007) Flickr Detection of events and places from clusters of social annotation Terms TF-IDF Geographic distance Farahat & Kamel (2010) Various benchmark data-sets  Document clustering Terms with stemming TF-IDF Variants of corpus-based semantic similarity Antai et al. (2011) Pdf text files Subject classification and topic detection Terms  LSI semantic distance Rangrej et al. (2011) Twitter Clustering tweets Terms TF-IDF Cosine similarity and Jaccard coefficient Xu & Orad, (2012) Twitter  Topical clustering and detecting representativeness Wikipedia terms TF-IDF Semantic distance through an external knowledgebase (Wikipedia) Muntean et al. (2012) Twitter Semantics behind hashtags Terms TF-IDF Jaccard distance measure Kireyev et al. (2009) Twitter Crisis and disaster detection Domain and event-specific terms Bag of words Cosine similarity  Avanija & Ramar (2013) Usenet group dataset Web documents clustering Terms  TF-IDF Semantic distance over an external ontology Suganya & Srinivasan (2013) Web search queries To infer query's goal from feedback sessions Pseudo Terms TF-IDF Semantic distance over an external knowledge-base (WordNet)  As it is seen in Table 1, in unsupervised clustering of tweets most studies use ‘terms’ (unigrams) rather than features such as grammar or syntax. More sophisticated NLP-type features such as length, n-grams (sequence of n terms), punctuations, and part of speech are mostly common in supervised learning for training classifiers. Also, the table suggests that semantic distance in one of its two forms is commonly used as the measure of similarity.   Having an unsupervised learning problem in hand, an important question to be answered is how to set a value for k. Different measures are offered in the literature to evaluate clustering performance when different values are selected for k. Similarity among objects within each cluster (intra-cluster similarity) and dissimilarity across different clusters (inter-cluster similarity) are the main criteria controlling the performance. Average distances between the centroid of a cluster and nodes in that cluster (centroid distance) can show the level of similarity among cluster members. This is called intra-cluster similarity (or intra-cluster centroid distance). Clusters with lower intra-cluster distance are more uniform and include more similar members. An Average of centroid distances over all nodes of dataset gives a measure of general intra-cluster similarity. However, at the same time with high intra-cluster similarity, a good clustering system must have low inter-cluster similarity. This criterion is measured through Davies-Bouldin (DB) index (Davis & Bouldin, 1979). Without going to calculation details of this index, it returns lower values for algorithms resulting in clusters with high intra-cluster similarity and low inter-cluster similarity. Therefore, the best k will be associated with the lowest DB index.  To summarize, k-means algorithm can cluster nodes of the IDN through their descriptions (pseudo-documents d(X)) into k classes based on the similarity between terms used in pseudo-documents. This is a term-level comparison which can be known as a first order analysis and matching. However, a more 344-5 profound matching among nodes should consider the similarity between ‘concepts’ addressed by users. Many studies have reported higher level of intra-cluster similarity when semantic distance (rather than a plain cosine distance) is used (see Antai et al. 2011 as an example). This requires highlighting hidden patterns of semantic similarity among pseudo-documents which can be handled through semantic transformation.    2.3 Semantic transformation Transformation from M-dimensional lexical space which explained above, into a lower size (let's say q-dimensional) semantic space, under some conditions can expose the underlying semantic correlations among terms and pseudo-documents. Dimensions of the new space are topics or common semantic themes in documents, and such a transformation is referred to as ‘topic modeling’. One important advantage of this approach is the fact the corpus statistics are the base of deriving semantic relations. Hidden patterns of semantic similarity are detected from co-occurrence of terms in multiple documents and therefore, the detected similarity will have a context-sensitive nature (Landauer et al. 1998). There are different forms of semantic transformation among which Latent Semantic Analysis (LSA), Probabilistic Latent Semantic Analysis (PLSA), and Latent Dircichlet Allocation (LDA) are the most popular ones. In the domain of construction, such techniques have recently found some applications in classification tasks. Classifying general conditions for contractual documents (Zhang & El-gohary, 2013) and classifying environmental regulatory documents (Zhou & El-gohary, 2014) can be mentioned among other limited examples of such applications.  Latent Semantic Analysis–LSA (or latent semantic indexing–LSI) is a powerful technique in vectorial semantics, working based on estimation of a matrix through Singular Value Decomposition (SVD) and inferring semantic relations among terms and documents. In terms of data analysis, SVD is a transformation of a dataset from one space in which the data has high variations, into a new space with lower variations. This transformation decomposes correlated variables and exposes the level of variation along each dimension in the new space. Therefore, by ignoring variations below a certain threshold, one can reach the best estimation of the original data in a lower-dimensional space. Mathematically, this transformation is decomposition of a rectangular matrix A into the product of three matrices: [2]      Where:   are formed as collections of eigenvectors of matrix  (called singular vectors);   is a diagonal matrix with singular values of matrix A (square root of eigenvalues of ) as its diagonal entries; and   is the least square fit of matrix A.  If the full matrix S is used, then,  and will be exactly the same. However, selecting a limited number of (say q) dimensions in S will make  an estimation of A in a lower dimensional space. As singular values of matrix A are sorted in a descending order in the diagonal of matrix S, selecting dimensions which are associated with the q largest singular values will be easily possible. The noise will be filtered out in the resulted estimation ( ), hence it will reflect the underlying structure beneath the first order data. Selection of q will influence the results; if it is set too large, then the noise of first order data will not be fully filtered, and if it is set too small, then some latent correlations will be lost. Reaching an optimal dimensionality for SVD is involved in some levels of trial and error (Antai et al. 2011).  LSA is in fact the result of applying SVD to the term-document matrix. As the result, semantic dependencies and similarities will be highlighted at a deeper level and in a low-dimension semantic space. The results will allow investigation in semantic correlation at three levels: term–term, term–document, and document–document. The latter correlation between pseudo-documents of our problem can be taken as a measure of semantic distance to evaluate semantic similarity between profile descriptions for nodes of the IDN.  344-6 3 CASE STUDY PROJECT We aim to apply the methods explained above in an actual infrastructure project to see how the main groups of social interests in the urban infrastructure system can be crystalized through analysis of online public participation contents. The ‘Eglinton Crosstown LRT’ line is an under-construction project to span 19km of Light Rail Transit, out of which 10km is tunnelled, cutting across Toronto in the east-west direction along Eglinton Avenue and connecting to Scarborough Rapid Transit system. The project at a glance can be summarized as: • Type: Light Rail Transit – New Construction • Construction start year: 2011 – Estimated completion year: 2020 • Estimated cost: CAD8.4 Billion – Funding: Provincial funds • Owner: Metrolinx – Ultimate operator: Toronto Transit commission (TTC)  The Crosstown project is one of the largest transit projects currently underway in North America and it has raised several issues that have gained public and media attention. The owner launched a Twitter account for the project on December 2011. At the time data we first collected data (September 2012), it had 521 followers, and now in more than two years, this number has increased into more than 2800 followers2. We applied k-means clustering through topic modeling to investigate which commonalities form the main groups of these followers and what topics they mainly discuss. 3.1 Followers Typology Using semantic distances (calculated through LSI) as a measure of similarity, IDN of the Eglinton Crosstown project on Twitter (September 2012) was clustered by applying LSI to node descriptions and using the outputs as a distance measure for k-means clustering. k (the number of clusters) was selected through distance-based performance measures as illustrated by Table 2. As seen, the lowest DB index (the highest inter-clustering similarity) and the lowest inter-cluster distance are resulted for k=4.  Table 2 – Evaluation of clustering with different number of clusters (nodes with blank descriptions are excluded)                       2 As of January 2015 k Clusters Clusters sizes Average centroid  distance for each cluster Average inter-community distance Performance (Davies-Bouldin Index) 2 C1 203 0.008 0.007 1.920  C2 204 0.006 3 C1 174 0.004 0.006 1.653 C2 96 0.012 C3 137 0.005 4 C1 53 0.012 0.005 1.425 C2 78 0.006 C3 119 0.004 C4 157 0.004 5 C1 110 0.004 0.008 1.730 C2 72 0.005 C3 84 0.004 C4 97 0.003 C5 44 0.012 6 C1 73 0.005 0.007 1.579 C2 91 0.004 C3 91 0.003 C4 45 0.012 C5 7 0.000 C6 100 0.004 344-7                                                       The size of the semantic space (q) was also set equal to 4; this was not only the same as number of clusters but also the same as number of communities detected in the same network via community detection (Nik Bakht and El-diraby, 2013b). It must be noted that some Twitter users leave their descriptions blank; these nodes were assigned to a separate cluster.  After clustering an IDN based on semantic similarity of its nodes' descriptions, semantic analysis helped to detect dominant themes in each cluster. For this purpose, descriptions for all nodes in one cluster were compiled as one document and then through LSI the top terms in each of these documents were detected. Table 3 lists top terms for the four clusters found in Eglinton Crosstown network through k-means clustering. Results represent outputs of LSA with 95% of total singular values, in conjunction with a ranker system. In this analysis, bi-grams (combinations of two successive words) were also added to features for latent semantic indexing. Some terms which can help to speculate dominant topics within each cluster are highlighted in bold in this table. Although highlighted terms can give an overview of dominating themes in each cluster, high level of mixing among terms does not allow reaching a firm conclusion confidently.  Table 3 – Top terms detected through Latent Semantic Indexing in each cluster Community Cluster1 Cluster2 Cluster3 Cluster4 Size 157 119 78 53  fan transit city culture  enthusiast york news interested  junkie lrt civic transit  coffee Mo`m account cities  nerd work world housing  political ttc awesome enthusiast  beer metrolinx business photographer  father updates mayor design  music association Town_hall music  views area ward planner  member village senior transport  cyclist good twitter issues  news enjoy things economics  writer project active interests  avid theatre citizen place  addict construction resident fan  designer leaside affairs avid  technology works district baseball  grad transport job human  runner environment advisory pop_culture  resident chair progressive policy  It must be noticed that the authors had formerly shown through social network analysis that this network has four major communities, labeled as: politicians, technical decision makers, city policy makers, and the community of the public (Nik Bakht and El-diraby, 2013a&b). The results of semantic clustering support those findings; dominant terms in cluster2 refer to the owner and operator of the project who are among the main technical actors and decision makers. Also terms such as construction, transport, and project can support this speculation that this cluster is composed of a group of people with interests in technical and engineering aspects of the project. Cluster 3 has top terms which refer to the city councillors and city decision makers. Town hall, mayor, and progressive (referring to the progressive conservative party) are among those terms. As it is seen, here the line to separate ‘politicians’ from ‘city policy makers’ is blurred as many terms describing these groups belong to the same semantic classes. This was not the case when community detection was performed by looking at the social connectivity among followers. Terms in clusters 1 and 4 suggest that they are composed of nodes from the public with two main themes of interest: art/planning, and politics. It may be hard to make a firm claim about these communities due to the diversity of their top terms; but referring back to results of bisection through social connectivity and reviewing the major interests of nodes belonging to the community of the public can explains the source 344-8 of such diversity. Most community leaders in the community of the public were either journalists or planners (Nik Bakht and El-diraby, 2013a). 3.2 Core Interests Repeating similar analysis on tweets collected through project-related hashtags mentioning the project ID can result in detecting the core themes of interests in online discussions around the project. We repeated the same procedure including cleaning and tokenizing tweets, and then applying k-means clustering based on semantic similarity among the tweets. We applied the analysis to 170 tweets collected between July and September 2013 through mentioning of the project ID (@CrosstownTO) and related hashtags (such as #CrosstownLRT, #Eglinton-Scarborough, etc.). The analysis resulted in five clusters and then we detected top terms in each cluster through LSI. Reviewing the list of these top terms suggests some commonalities in each cluster:  • Economy is the dominant theme of one of the clusters. Terms such as XXX (representing monetary values), Dollars, Dollars_Risk, Investing, Millions,  Oaarchitects, Construction_Safety, Mega_Contract are among top terms of this cluster; • The second cluster is dominated by terms related to technical features of the project: Construction, Safety, Contract, Consultant, Contractor, Procurement, Procurement_Process, LRT_Project, and Tunnel_Boring are among those terms; • Community related issues, and local/regional aspects are the third group of topics tweeted regarding the project. Terms such as Neighbourhood, Community, Neighbourhood_Revitalization, Eglinton, Yonge_Bus, Finch, and Finch_LRT, Realestate are some of the key terms representing these topics; • A smaller group with fewer number of tweets compared to the above three classes use terms which imply Political issues; Councilor, City_Staff, Provincial_Government and Harper are the most outstanding terms of this type; • Last but not the least, there was a small cluster that we could not realize a specific team among its top terms. When tweets in this cluster were screened, we found that they mostly communicate news and updates about the project, its schedule and improvement, and public meetings. 4 DISCUSSION AND CONCLUDING REMARKS Titles assigned to the clusters above may be arguable. For example, when focusing on sets of terms detected for different classes of interest, one may argue existence of some anomalies; although terms Construction and Safety are in the class we called ‘Technical issues’, the bigram Construction_Safety is in the cluster of ‘Economy’. The same is true for the term Contract versus the bigram Mega_Contract. Moreover, while Oaarchitects representing Ontario Association of Architects (OOA) is expected to be in the technical cluster, it is among the economy related issues. This made us take a closer look at the project and the associated discussions in this period.   In late July 2013, CDAO (Construction & Design Alliance of Ontario), representing about 200 firms, published an online report, criticizing Infrastructure Ontario and the project owner (Metrolinx) for the procurement contracting process. The report claims that bundling station and maintenance facility construction into one contract does not allow many competitors (particularly the local small or medium sized firms and consortia) to bid and kills the competition. They claimed that as a result, the selected proposal is imposing up to $500million extra cost in design and construction on tax-payers (CDAO, 2013). In a similar report, OAA questioned feasibility of awarding $1.75 billion architectural components in the project worth $4 billion in total in form of a single mega-contract (OAA, 2013). These reports also mention other consequences to the bundled contract including ‘stifling innovation’, accepting bids from a foreign multi-national firm (rather than domestic industries), and concerns regarding the public safety due to contracting the project to the foreigner companies who – according to the reports – lack professional trainings in this regard. Metrolinx responded to such criticism in August of that year, listing Canadian companies having roles in design and construction of the project and providing the hardware. These reports attracted attention in the social media, forming disputes in August 2013. In the data analyzed here a total of 30 tweets were discussing this issue, mainly by emphasizing the economic impacts of such a 344-9 decision and that is why the abovementioned terms can be found in the same cluster as terms Dollars, Investing, and Monetary values.  In conclusion, although semantic clustering can successfully detect groups of terms used to discuss the same aspect of a project, making a conclusion on the topic addressed by those terms is a subjective and judgment-based task. Using a taxonomy (when a specific context is scoped) and positioning distribution of detected terms in different branches of the taxonomy may help to automate this phase to some extent. This must be investigated in the future research. Consequently, outputs of the presented method are certainly a good starting point for decision makers to understand the composition of the e-society and the content of discussions around the infrastructure project at stake. This can add meaning to the unstructured and chaotic process of online public engagement by bundling participants and their vested interests. Findings of this paper suggest that it is worthwhile to invest more time and effort in this regard. Expanding domain of the analysis and experimenting with larger datasets can result in more insights. References Muntean, Cristina Ioana, Gabriela Andreea Morar, and Darie Moldovan. "Exploring the meaning behind Twitter Hashtags through clustering." Lecture Notes in Business Information Processing. Springer, 2012. 231-242. Antai, Roseline, Chris Fox, and Udo Kruschwitz. "The use of latent semantix indexing to cluster documents into their subject areas." Proceedings of the Fifth Language Technology Conference. Springer, 2011. 161-166. Avanija, K., and K. Ramar. "A hybrid approach using PSO and K-means for semantic clustering of Web." Journal of Web Engineering, 2013: 249-264. Azhar, Salman, and John M. Ablen. "Investigating social media applications for the construction industry." Creative construction conference CC2014. Prague, Czech Republic: Elsevier, 2014. 42-51. Bregman, Susan, and Kari Edison Watkins. Best practices for transportation agency use of social media. CRC Press, 2013. Bregman, Susan, TRB, and TCRPSYNTH. TCRP-Synthesis 99-Use of social media in public transportation. Washington: TRB, 2012. Bruijn, Hans de, and Ernst ten Heuvelhof. Networks and decision making. Utrecht, Netherlands: Lemma Publishers, 2000. Davis, D., and W. Bouldin. "A cluster separation measure." IEEE Transactions on Pattern Analysis and Machine Intelligence 1, no. 2 (1979): 224-227. Farahat, Ahmed, and Mohamed S. Kamel. "Enhancing document clustering using hybrid models for semantic similarity." The eight workshop on text mining at the SIAM International conference on data mining (SDM). 2010. 83-92. Kalafatis, Themos . "Twitter analytics: cluster analysis reveals similar twitter users." Life analytics. May 2009. http://lifeanalytics.blogspot.ca/2009/05/twitter-analytics-cluster-analysis.html. Kennedy, Lyndon, Mor Naaman, Shane Ahern, Rahul Nair, and Tye Rattenbury. "How Flickr helps us make sense of the world: context and content in community contributed." Proceedings of the 15th international conference on Multimedia. ACM, 2007. Kireyev, Kirill, Leysia Palen, and Kenneth M. Anderson. "Applications of topics models to analysis of disaster-related Twitter data." 2009. Landauer, Thomas K, Peter W. Foltz, and Darrell Laham. "An introduction to latent semantic analysis." Discourse processes 25 (1998): 259-284. Nik Bakht, Mazdak, and Tamer E. El-diraby. "Analyzing infrastructure discussion networks: order of 'influence' in chaos of 'followers'." Csce annual conference-4th construction specialty conference . Montreal: CSCE , 2013a. Nik Bakht, Mazdak, and Tamer E. El-Diraby. "What does social media say about the infrastructure construction project?" Beijing, China: CIB W78, 2013b. Nik Bakht, Mazdak, and Tamer El-diraby. "Infrastructure discussion networks: analyzing social media debates of LRT projects in North American cities." TRB 93rd Annual meeting. Washington DC.: Transportation Research Board, 2014. 344-10 Papadopoulos, Symeon , Yiannis Kompatsiaris, Athena Vakali, and Ploutarchos Spyridonos. "Community detection in social media." J. data mining and knowledge discovery 24, no. 3 (2012): 515-554. Pitsilis, Georgios, Xiangliang Zhang, and Wei Wang. "Clusteing recommenders in collaborative filtering using explicit trust information." Proceedings of: Trust Management V. Copenhagen, Denmark: Springer, 2011. 82-97. Porter, M.,. "The Economic Performance of Regions." Regional Studies 37, no. 6-7 (2003): 545–546. Rangrej, Aniket, Sayali Kulkarni, and Ashish V. Tendlkar. "Comparative study of clustering techniques for short test documents." WWW 2011. Hyderabad: WWW, 2011. 111-112. Steinhaeuser, Karsten, and Nitesh V. Chawla. "Community detection in a large real-world social network." International conference on social computing, behavioral modeling and prediction. Phoenix, Arizona, USA: Springer, 2008. 168-175. Suganya, L., and B. Srinivasan. "Efficient semantic similarity based FCM for inferring user search goals with feedback sessions." International journal of computer trends & technolog 4, no. 9 (2013): 3316-3321. Xu, Tan, and Douglas W. Orad. "Wikipedia-based topic clustering for microblogs." Proceedings of the American society for information science & technology 48, no. 1 (2012). Zhang, Jiansong, and Nora El-gohary. "Semantic NLP-based information extraction from construction regulatory documents for automated compliance checking." Journal of computing in civil engineering, 2013. Zhou, Peng, and Nora El-gohary. "Semantic-bsed text classifiction of environemental regulatory documents for supporting automated environmental compliance checking in construction." Construction Research Congress (CRC-2014). Atlanta, Ga, USA: ASCE, 2014. 897-906.      344-11  5th International/11th Construction Specialty Conference 5e International/11e Conférence spécialisée sur la construction    Vancouver, British Columbia June 8 to June 10, 2015 / 8 juin au 10 juin 2015   TOPIC MODELING FOR INFRASTRUCTURE-RELATED DISCUSSIONS IN ONLINE SOCIAL MEDIA M. Nik-Bakht1,2, T. E. El-Diraby1 1 Center for Civil Informatics, Department of Civil Engineering, University of Toronto, Canada  2 mazdak.nikbakht@mail.utioronto.ca Abstract: Decision making for construction of modern civil infrastructure not only involves internal stakeholders, but also aims to include interests of as many external stakeholders as possible. In mega-projects, complexity and diversity of stakeholders call for more advanced communication tools and channels. Extensive prevalence of social web as a two-way communication channel during the last decade has caused a paradigm shift in communication among the e-society, and this has attracted the attention of decision makers in the domain of urban infrastructure among other domains. Although having a wide public outreach, the open and unstructured nature of inputs from the e-society results in chaos and makes it difficult to distil knowledge from the contents communicated by the public. This paper presents tools from topic modeling to process such an unstructured data collected from online social media into information which can be plugged into the process of decision making. We use k-means clustering to cluster followers of an infrastructure project on micro-blogging website Twitter based on semantic similarity among their user profile descriptions. This helps profiling the main groups of followers of the infrastructure project and can provide decision makers with valuable hints regarding typology of external stakeholders. We also extend our analysis to project-related tweets through Latent Semantic Indexing, and find the main topics discussed. The latter guide help decision makers understand the public’s major vested interests in the project. We have applied the proposed method to a Light Rail Transit (LRT) mega-project in Toronto, Ontario and have discussed the results. 1 INTRODUCTION – INFRASTRUCTURE DISCUSSION NETWORKS (IDN) Decision making for construction and development of civil infrastructure in the modern society is involved in a networked procedure. This is not only related to the network of interdependent sub-decisions, random variables involved, and the governing criteria; but also requires interactions among a multitude of decision makers (internal stakeholders), each having their own beliefs, goals, and interests in the project. On the other hand, the external stakeholders (those affected by the decisions) and their vested interests must be considered in decision making, and it adds even more to the complexity of the decision network. The final decision in such a networked scenario will be in form of a ‘package-deal’ (Bruijn and Heuvelhof 2000); a sub-optimal solution which accommodates major interests of different internal and external stakeholders of the project.   One major goal of community engagement practices in the domain of urban infrastructure is to detect the main groups of external stakeholders along with their interests and to involve them in the network of decision making. Public engagement in infrastructure projects has traditionally been carried out through off-line tools such as community meetings, public hearings, questionnaires, etc. or online tools such as 344-1 Web 1.0 portals and project websites which are considered as one-way communication channels to educate the society about the project and its related decisions. On the other hand, as a social-network-related issue, this procedure is influenced, if not re-shaped, by the wind of Web 2.0 (social web). Many public engagement agencies have recently started trying tools such as weblogs and micro-blogging websites (such as Blogger and Twitter), multimedia sharing websites (including YouTube and Instagram), and other tools for a bi-directional communication with the public about infrastructure projects (Bregman and Watkins 2013), (Azhar and Ablen 2014). Social web is not only a portal used to comment on design and planning; rather it is a platform for bringing community members along with their ideas together. Web 2.0–based engagement will hence create a ‘network of people’ together with a ‘network of ideas’. Such networks formed around infrastructure projects are called Infrastructure Discussion Networks – IDNs (Nik Bakht and El-diraby, 2013a&b & 2014).  The formation and evolution of these two interdependent networks (network of people and network of ideas), around the physical network of the infrastructure should be closely and precisely monitored and different trends in them must be followed in order to make the best out of users’ ideas and innovations in design and construction of a sustainable infrastructure. This requires employing tools and methods from network science, information retrieval, and computational linguistics to link people to their ideas and understand their patterns of connectivity over IDNs. One important outcome of monitoring such patterns is to detect cores of interest in a project, in conjunction with the people who support them. Some potential applications of making such a connection between users and ideas over the IDN can be listed as: • Classifying typology of stakeholders, particularly the end users of the  infrastructure system; • Profiling core interests to be involved in package deal solutions offered for infrastructure project; • Detecting relations and interactions among different community groups; • Finding possible communication bottlenecks in the process of public consultation; • Demand detection in a more direct, and pro-active manner; • Activating ‘user innovation’ by understanding and ranking community generated inputs.  Achieving such goals requires detection and analysis of interests for nodes and communities of the IDN. Steinhaeuser & Chawla (2008) suggested assigning interest as node attribute to the social network and then clustering the network based on the similarity of attribute. In a similar study, Kalafatis (2009) harvested Twitter users based on their similar interests through occurrence of some pre-defined keywords in their Twitter biography. To include the degree of importance for nodes, Nik Bakht & El-Diraby (2013a) proposed community detection through analysis of the network topology, and then extracting shared interests within each community through computational linguistics. While segmenting users through the similarity among their attributes is a ‘top-down’ approach, finding communities and then interpreting similarities among them can be considered as a ‘bottom-up’ method. Applications such as topic detection in collaborative tagging systems, reviewed by Papadopoulos et al. (2012), and social trust evaluation for recommender systems, addressed by Pitsilis et al. (2011) are among other examples which combine the two approaches to create tools for profiling and labeling groups of followers. Although the bottom-up approach is helpful for understanding the social construct of e-society around the infrastructure project; it does not necessarily result in cores of interest in the project. In order to detect the major lines of interest to be included in the package-deal solution, the main clusters of ideas discussed by project followers must be detected and highlighted. This paper presents a top-down approach to reach this goal. We start by clustering users and ideas based on their semantic similarity (rather than social connectivity). We use data collected from Micro-blogging website Twitter as the most popular social media used in the domain (Bregman, 2012), (Azhar & Abeln, 2014). By focusing an LRT mega-project in Toronto, Ontario, Canada we use semantic analysis combined with clustering algorithms to detect the main lines of interest in following this project on Twitter. We apply our method to the IDN of the project and cluster its nodes based on the semantic similarity among their profiles. Then we focus on tweets about the project over a certain period of time and use our proposed method to detect the core themes discussed. We will finally show how the results can be a representative of the public community's vested interests in practice. 344-2 2 METHODOLOGY– CLUSTERING BASED ON SEMANTIC SIMILARITY OF USER PROFILES Users in online social network websites normally address their interests through their online activities including contents they share, and comments they leave. However, in many cases they also state their interests directly in user profile descriptions (also known as biography or ‘bio’ for short). Twitter users can indicate their location, professional affiliations, and interests in a short bio in fewer than 160 characters. Such descriptions are accessible through Twitter API (Application Programming Interface). Also, one can use Twitter API to retrieve most recent tweets of a user1. Such tweets can also indirectly reflect users’ interests. The aim of this paper is to partition the IDN into groups of nodes with similar interests.  We collect descriptions/tweets of all nodes of an IDN as a set of ‘pseudo-documents’, and form an analysis corpus. Data clustering techniques combined with semantic approaches are then employed to cluster the corpus based upon the similarity between those pseudo-documents. The method can therefore be simplified as: clustering nodes of an IDN based on semantic similarity among their descriptions collected from their user profiles. This requires cleaning and pre-processing the collected data; modeling it and evaluating the semantic similarity among each pair of data points, and finally clustering nodes to the most similar groups.  2.1 Pre-processing and modeling users as vectors In order to analyze documents they must be first modeled as data-points (vectors) with specific attributes (dimensions) and attribute values (entries). In text mining and natural language processing (NLP), dimensions of analysis vector space (called features in NLP and attributes in data mining) are associated with terms of the corpus. Texts collected via Twitter API, are normally not in a uniform encoding format and also include some ‘noise’. Therefore, the first step in modeling documents is to tokenize, unify, and clean up the collected data. These tasks are called pre-processing and include the following steps:  Cleaning: • Removing html tags and attributes which are not visible in a browser; • Replacing all html character codes (such as &amp, &quot, etc.) with their ASCII equivalents; • Removing all URLs (as most of them are associated with advertisements and commercials); • Removing Twitter specific characters such as hash-tags (#) and mentioning handles (@); • Replacing monetary values and percentages with a trackable variable (for this purpose we transformed all numbers followed or preceded by dollar sign to XXX and all numbers followed or preceded by percentage sign to XX); • Removing all other numbers (as they are less probable to contribute to semantics.  Tokenizing: • Splitting texts at white spaces; • Decomposing clitics and punctuations from their hosts; • Saving ellipsis and other forms of multiple punctuations as separate tokens.  All these activities were performed using RegEx (regular expression). Stemming may also be a part of pre-processing; i.e. all words with the same root could be transformed into their roots. However, this was not done in the present paper due to aggressive nature of most stemmers available and confusions caused by stemming. Apart from the noise, common words with no specific semantics (such as conjunctions, articles, the gibberish, punctuations, etc.) should be filtered out from the corpus. This is normally done by introducing a ‘stop list’ (a complete list of such words). We started by a compilation of some standard stop lists and added new terms to it as the analyses progressed. As the text processing is involved in occurrence and co-occurrence of terms in multiple documents, terms which appear in one document only, were also removed to reduce dimensions of the analysis space.  A dictionary was formed by collecting all tokens of the cleaned corpus. Each term in this dictionary would be one dimension of the analysis space. Project followers were then modeled as vectors in such a vector 1 Currently, up to 3200 most recent tweets by a user are accessible 344-3                                                      space. Assuming the dictionary contains M terms in total (t1 to tM), each user X with description d(X) that is called a pseudo-document here and is specified by a set of attributes (terms) t i, was modeled as an M-dimensional vector with non-zero values in its ith entries only. In the simplest case, d(Xj) i  is 1, if term ti occurs in description of node Xj, and is 0, otherwise. A collection of pseudo documents (modeled as column vectors d(X)) for all members of an IDN of size N would then form an  matrix which is equivalent to ‘term-document’ matrices in topic modeling, and attribute-relation tables in data analysis. Such a matrix takes the following form:    More advanced versions of such a matrix can also be considered; they include bag of words in which entries are positive integers reflecting occurrence frequency of words in pseudo-documents, or using TF-IDF (Term frequency – Inverse Document Frequency) as attribute values, which is a measure describing the level of representativeness of a term for a specific user. 2.2 k-means clustering By considering each node of the IDN as a data-point and each word of the dictionary as an attribute, the abovementioned term-document matrix collects attribute values for all data-points. It can accordingly be used to formulate the top-down clustering as an unsupervised clustering problem. The goal of such a problem is to partition database D into k classes C such that members in each class are the most similar to each other and less similar to members from other classes. Global optimality (exhaustively enumerating all possible partitions), and heuristic methods are the two main options for performing such a task. Heuristic search can be either agglomerative or divisive; while the former starts with individual nodes and looks for the most similar ones to be collected in the same group, the latter starts at the network level and tries to find the best cut-offs. k-means clustering and k-medoids clustering or PAM (Partition around medoids) are two of the best known heuristic methods for clustering. Many studies in the literature have successfully applied k-means to cluster documents or to detect dominant themes in them. Table 1 lists some recent works in clustering short-length texts (such as tweets, URLs, and comments).   Clustering needs a measure of similarity before anything else. Different metrics are used in the literature to measure distance (dissimilarity) among data-points. Minkowski distance (with its two well-known variants: Manhattan distance, and Euclidean distance), cosine distance (calculated based on the angle between two datasets as two vectors), and Jaccard distance can be mentioned among other basic distance metrics. Distance function must be carefully designed associated with the context of use, and by using the domain knowledge. For semantic clustering purposes, two main groups of distances are used: first, calculating distance through corpus statistics; and second, calculating semantic distance of terms though an external ontology or knowledge-base. The latter is calculated based on the distance of two terms in a taxonomy and the former is the vector distance between pseudo-documents calculated either as cosine or as Jaccard distance. In the present problem, these can be calculated as follows:  [1a]   Cosine distance:    [1b]   Jaccard distance:    In which:    is the value of mth attribute (mth term) for user i in the term-document matrix;  is the norm of vector ;  is the number of terms in common between profiles of users i and j; and  is the number of distinct terms in profiles of users i and j.  After selecting a distance measure, k-means algorithm takes the following steps:  • Randomly selects k nodes and takes them as centroids of the k communities; 344-4 • Assigns each node of the IDN to the community of its closest centroid; • Centroids are re-calculated for each community and the second step is repeated;  • The algorithm stops when, for a specific number of iterations the position of centroids do not change beyond a certain threshold (stopping criteria). Table 1 – Semantic clustering of short texts in the literature Author Data-set Domain Goal Features Feature values Distance measure Kennedy et al. (2007) Flickr Detection of events and places from clusters of social annotation Terms TF-IDF Geographic distance Farahat & Kamel (2010) Various benchmark data-sets  Document clustering Terms with stemming TF-IDF Variants of corpus-based semantic similarity Antai et al. (2011) Pdf text files Subject classification and topic detection Terms  LSI semantic distance Rangrej et al. (2011) Twitter Clustering tweets Terms TF-IDF Cosine similarity and Jaccard coefficient Xu & Orad, (2012) Twitter  Topical clustering and detecting representativeness Wikipedia terms TF-IDF Semantic distance through an external knowledgebase (Wikipedia) Muntean et al. (2012) Twitter Semantics behind hashtags Terms TF-IDF Jaccard distance measure Kireyev et al. (2009) Twitter Crisis and disaster detection Domain and event-specific terms Bag of words Cosine similarity  Avanija & Ramar (2013) Usenet group dataset Web documents clustering Terms  TF-IDF Semantic distance over an external ontology Suganya & Srinivasan (2013) Web search queries To infer query's goal from feedback sessions Pseudo Terms TF-IDF Semantic distance over an external knowledge-base (WordNet)  As it is seen in Table 1, in unsupervised clustering of tweets most studies use ‘terms’ (unigrams) rather than features such as grammar or syntax. More sophisticated NLP-type features such as length, n-grams (sequence of n terms), punctuations, and part of speech are mostly common in supervised learning for training classifiers. Also, the table suggests that semantic distance in one of its two forms is commonly used as the measure of similarity.   Having an unsupervised learning problem in hand, an important question to be answered is how to set a value for k. Different measures are offered in the literature to evaluate clustering performance when different values are selected for k. Similarity among objects within each cluster (intra-cluster similarity) and dissimilarity across different clusters (inter-cluster similarity) are the main criteria controlling the performance. Average distances between the centroid of a cluster and nodes in that cluster (centroid distance) can show the level of similarity among cluster members. This is called intra-cluster similarity (or intra-cluster centroid distance). Clusters with lower intra-cluster distance are more uniform and include more similar members. An Average of centroid distances over all nodes of dataset gives a measure of general intra-cluster similarity. However, at the same time with high intra-cluster similarity, a good clustering system must have low inter-cluster similarity. This criterion is measured through Davies-Bouldin (DB) index (Davis & Bouldin, 1979). Without going to calculation details of this index, it returns lower values for algorithms resulting in clusters with high intra-cluster similarity and low inter-cluster similarity. Therefore, the best k will be associated with the lowest DB index.  To summarize, k-means algorithm can cluster nodes of the IDN through their descriptions (pseudo-documents d(X)) into k classes based on the similarity between terms used in pseudo-documents. This is a term-level comparison which can be known as a first order analysis and matching. However, a more 344-5 profound matching among nodes should consider the similarity between ‘concepts’ addressed by users. Many studies have reported higher level of intra-cluster similarity when semantic distance (rather than a plain cosine distance) is used (see Antai et al. 2011 as an example). This requires highlighting hidden patterns of semantic similarity among pseudo-documents which can be handled through semantic transformation.    2.3 Semantic transformation Transformation from M-dimensional lexical space which explained above, into a lower size (let's say q-dimensional) semantic space, under some conditions can expose the underlying semantic correlations among terms and pseudo-documents. Dimensions of the new space are topics or common semantic themes in documents, and such a transformation is referred to as ‘topic modeling’. One important advantage of this approach is the fact the corpus statistics are the base of deriving semantic relations. Hidden patterns of semantic similarity are detected from co-occurrence of terms in multiple documents and therefore, the detected similarity will have a context-sensitive nature (Landauer et al. 1998). There are different forms of semantic transformation among which Latent Semantic Analysis (LSA), Probabilistic Latent Semantic Analysis (PLSA), and Latent Dircichlet Allocation (LDA) are the most popular ones. In the domain of construction, such techniques have recently found some applications in classification tasks. Classifying general conditions for contractual documents (Zhang & El-gohary, 2013) and classifying environmental regulatory documents (Zhou & El-gohary, 2014) can be mentioned among other limited examples of such applications.  Latent Semantic Analysis–LSA (or latent semantic indexing–LSI) is a powerful technique in vectorial semantics, working based on estimation of a matrix through Singular Value Decomposition (SVD) and inferring semantic relations among terms and documents. In terms of data analysis, SVD is a transformation of a dataset from one space in which the data has high variations, into a new space with lower variations. This transformation decomposes correlated variables and exposes the level of variation along each dimension in the new space. Therefore, by ignoring variations below a certain threshold, one can reach the best estimation of the original data in a lower-dimensional space. Mathematically, this transformation is decomposition of a rectangular matrix A into the product of three matrices: [2]      Where:   are formed as collections of eigenvectors of matrix  (called singular vectors);   is a diagonal matrix with singular values of matrix A (square root of eigenvalues of ) as its diagonal entries; and   is the least square fit of matrix A.  If the full matrix S is used, then,  and will be exactly the same. However, selecting a limited number of (say q) dimensions in S will make  an estimation of A in a lower dimensional space. As singular values of matrix A are sorted in a descending order in the diagonal of matrix S, selecting dimensions which are associated with the q largest singular values will be easily possible. The noise will be filtered out in the resulted estimation ( ), hence it will reflect the underlying structure beneath the first order data. Selection of q will influence the results; if it is set too large, then the noise of first order data will not be fully filtered, and if it is set too small, then some latent correlations will be lost. Reaching an optimal dimensionality for SVD is involved in some levels of trial and error (Antai et al. 2011).  LSA is in fact the result of applying SVD to the term-document matrix. As the result, semantic dependencies and similarities will be highlighted at a deeper level and in a low-dimension semantic space. The results will allow investigation in semantic correlation at three levels: term–term, term–document, and document–document. The latter correlation between pseudo-documents of our problem can be taken as a measure of semantic distance to evaluate semantic similarity between profile descriptions for nodes of the IDN.  344-6 3 CASE STUDY PROJECT We aim to apply the methods explained above in an actual infrastructure project to see how the main groups of social interests in the urban infrastructure system can be crystalized through analysis of online public participation contents. The ‘Eglinton Crosstown LRT’ line is an under-construction project to span 19km of Light Rail Transit, out of which 10km is tunnelled, cutting across Toronto in the east-west direction along Eglinton Avenue and connecting to Scarborough Rapid Transit system. The project at a glance can be summarized as: • Type: Light Rail Transit – New Construction • Construction start year: 2011 – Estimated completion year: 2020 • Estimated cost: CAD8.4 Billion – Funding: Provincial funds • Owner: Metrolinx – Ultimate operator: Toronto Transit commission (TTC)  The Crosstown project is one of the largest transit projects currently underway in North America and it has raised several issues that have gained public and media attention. The owner launched a Twitter account for the project on December 2011. At the time data we first collected data (September 2012), it had 521 followers, and now in more than two years, this number has increased into more than 2800 followers2. We applied k-means clustering through topic modeling to investigate which commonalities form the main groups of these followers and what topics they mainly discuss. 3.1 Followers Typology Using semantic distances (calculated through LSI) as a measure of similarity, IDN of the Eglinton Crosstown project on Twitter (September 2012) was clustered by applying LSI to node descriptions and using the outputs as a distance measure for k-means clustering. k (the number of clusters) was selected through distance-based performance measures as illustrated by Table 2. As seen, the lowest DB index (the highest inter-clustering similarity) and the lowest inter-cluster distance are resulted for k=4.  Table 2 – Evaluation of clustering with different number of clusters (nodes with blank descriptions are excluded)                       2 As of January 2015 k Clusters Clusters sizes Average centroid  distance for each cluster Average inter-community distance Performance (Davies-Bouldin Index) 2 C1 203 0.008 0.007 1.920  C2 204 0.006 3 C1 174 0.004 0.006 1.653 C2 96 0.012 C3 137 0.005 4 C1 53 0.012 0.005 1.425 C2 78 0.006 C3 119 0.004 C4 157 0.004 5 C1 110 0.004 0.008 1.730 C2 72 0.005 C3 84 0.004 C4 97 0.003 C5 44 0.012 6 C1 73 0.005 0.007 1.579 C2 91 0.004 C3 91 0.003 C4 45 0.012 C5 7 0.000 C6 100 0.004 344-7                                                       The size of the semantic space (q) was also set equal to 4; this was not only the same as number of clusters but also the same as number of communities detected in the same network via community detection (Nik Bakht and El-diraby, 2013b). It must be noted that some Twitter users leave their descriptions blank; these nodes were assigned to a separate cluster.  After clustering an IDN based on semantic similarity of its nodes' descriptions, semantic analysis helped to detect dominant themes in each cluster. For this purpose, descriptions for all nodes in one cluster were compiled as one document and then through LSI the top terms in each of these documents were detected. Table 3 lists top terms for the four clusters found in Eglinton Crosstown network through k-means clustering. Results represent outputs of LSA with 95% of total singular values, in conjunction with a ranker system. In this analysis, bi-grams (combinations of two successive words) were also added to features for latent semantic indexing. Some terms which can help to speculate dominant topics within each cluster are highlighted in bold in this table. Although highlighted terms can give an overview of dominating themes in each cluster, high level of mixing among terms does not allow reaching a firm conclusion confidently.  Table 3 – Top terms detected through Latent Semantic Indexing in each cluster Community Cluster1 Cluster2 Cluster3 Cluster4 Size 157 119 78 53  fan transit city culture  enthusiast york news interested  junkie lrt civic transit  coffee Mo`m account cities  nerd work world housing  political ttc awesome enthusiast  beer metrolinx business photographer  father updates mayor design  music association Town_hall music  views area ward planner  member village senior transport  cyclist good twitter issues  news enjoy things economics  writer project active interests  avid theatre citizen place  addict construction resident fan  designer leaside affairs avid  technology works district baseball  grad transport job human  runner environment advisory pop_culture  resident chair progressive policy  It must be noticed that the authors had formerly shown through social network analysis that this network has four major communities, labeled as: politicians, technical decision makers, city policy makers, and the community of the public (Nik Bakht and El-diraby, 2013a&b). The results of semantic clustering support those findings; dominant terms in cluster2 refer to the owner and operator of the project who are among the main technical actors and decision makers. Also terms such as construction, transport, and project can support this speculation that this cluster is composed of a group of people with interests in technical and engineering aspects of the project. Cluster 3 has top terms which refer to the city councillors and city decision makers. Town hall, mayor, and progressive (referring to the progressive conservative party) are among those terms. As it is seen, here the line to separate ‘politicians’ from ‘city policy makers’ is blurred as many terms describing these groups belong to the same semantic classes. This was not the case when community detection was performed by looking at the social connectivity among followers. Terms in clusters 1 and 4 suggest that they are composed of nodes from the public with two main themes of interest: art/planning, and politics. It may be hard to make a firm claim about these communities due to the diversity of their top terms; but referring back to results of bisection through social connectivity and reviewing the major interests of nodes belonging to the community of the public can explains the source 344-8 of such diversity. Most community leaders in the community of the public were either journalists or planners (Nik Bakht and El-diraby, 2013a). 3.2 Core Interests Repeating similar analysis on tweets collected through project-related hashtags mentioning the project ID can result in detecting the core themes of interests in online discussions around the project. We repeated the same procedure including cleaning and tokenizing tweets, and then applying k-means clustering based on semantic similarity among the tweets. We applied the analysis to 170 tweets collected between July and September 2013 through mentioning of the project ID (@CrosstownTO) and related hashtags (such as #CrosstownLRT, #Eglinton-Scarborough, etc.). The analysis resulted in five clusters and then we detected top terms in each cluster through LSI. Reviewing the list of these top terms suggests some commonalities in each cluster:  • Economy is the dominant theme of one of the clusters. Terms such as XXX (representing monetary values), Dollars, Dollars_Risk, Investing, Millions,  Oaarchitects, Construction_Safety, Mega_Contract are among top terms of this cluster; • The second cluster is dominated by terms related to technical features of the project: Construction, Safety, Contract, Consultant, Contractor, Procurement, Procurement_Process, LRT_Project, and Tunnel_Boring are among those terms; • Community related issues, and local/regional aspects are the third group of topics tweeted regarding the project. Terms such as Neighbourhood, Community, Neighbourhood_Revitalization, Eglinton, Yonge_Bus, Finch, and Finch_LRT, Realestate are some of the key terms representing these topics; • A smaller group with fewer number of tweets compared to the above three classes use terms which imply Political issues; Councilor, City_Staff, Provincial_Government and Harper are the most outstanding terms of this type; • Last but not the least, there was a small cluster that we could not realize a specific team among its top terms. When tweets in this cluster were screened, we found that they mostly communicate news and updates about the project, its schedule and improvement, and public meetings. 4 DISCUSSION AND CONCLUDING REMARKS Titles assigned to the clusters above may be arguable. For example, when focusing on sets of terms detected for different classes of interest, one may argue existence of some anomalies; although terms Construction and Safety are in the class we called ‘Technical issues’, the bigram Construction_Safety is in the cluster of ‘Economy’. The same is true for the term Contract versus the bigram Mega_Contract. Moreover, while Oaarchitects representing Ontario Association of Architects (OOA) is expected to be in the technical cluster, it is among the economy related issues. This made us take a closer look at the project and the associated discussions in this period.   In late July 2013, CDAO (Construction & Design Alliance of Ontario), representing about 200 firms, published an online report, criticizing Infrastructure Ontario and the project owner (Metrolinx) for the procurement contracting process. The report claims that bundling station and maintenance facility construction into one contract does not allow many competitors (particularly the local small or medium sized firms and consortia) to bid and kills the competition. They claimed that as a result, the selected proposal is imposing up to $500million extra cost in design and construction on tax-payers (CDAO, 2013). In a similar report, OAA questioned feasibility of awarding $1.75 billion architectural components in the project worth $4 billion in total in form of a single mega-contract (OAA, 2013). These reports also mention other consequences to the bundled contract including ‘stifling innovation’, accepting bids from a foreign multi-national firm (rather than domestic industries), and concerns regarding the public safety due to contracting the project to the foreigner companies who – according to the reports – lack professional trainings in this regard. Metrolinx responded to such criticism in August of that year, listing Canadian companies having roles in design and construction of the project and providing the hardware. These reports attracted attention in the social media, forming disputes in August 2013. In the data analyzed here a total of 30 tweets were discussing this issue, mainly by emphasizing the economic impacts of such a 344-9 decision and that is why the abovementioned terms can be found in the same cluster as terms Dollars, Investing, and Monetary values.  In conclusion, although semantic clustering can successfully detect groups of terms used to discuss the same aspect of a project, making a conclusion on the topic addressed by those terms is a subjective and judgment-based task. Using a taxonomy (when a specific context is scoped) and positioning distribution of detected terms in different branches of the taxonomy may help to automate this phase to some extent. This must be investigated in the future research. Consequently, outputs of the presented method are certainly a good starting point for decision makers to understand the composition of the e-society and the content of discussions around the infrastructure project at stake. This can add meaning to the unstructured and chaotic process of online public engagement by bundling participants and their vested interests. Findings of this paper suggest that it is worthwhile to invest more time and effort in this regard. Expanding domain of the analysis and experimenting with larger datasets can result in more insights. References Muntean, Cristina Ioana, Gabriela Andreea Morar, and Darie Moldovan. "Exploring the meaning behind Twitter Hashtags through clustering." Lecture Notes in Business Information Processing. Springer, 2012. 231-242. Antai, Roseline, Chris Fox, and Udo Kruschwitz. "The use of latent semantix indexing to cluster documents into their subject areas." Proceedings of the Fifth Language Technology Conference. Springer, 2011. 161-166. Avanija, K., and K. Ramar. "A hybrid approach using PSO and K-means for semantic clustering of Web." Journal of Web Engineering, 2013: 249-264. Azhar, Salman, and John M. Ablen. "Investigating social media applications for the construction industry." Creative construction conference CC2014. Prague, Czech Republic: Elsevier, 2014. 42-51. Bregman, Susan, and Kari Edison Watkins. Best practices for transportation agency use of social media. CRC Press, 2013. Bregman, Susan, TRB, and TCRPSYNTH. TCRP-Synthesis 99-Use of social media in public transportation. Washington: TRB, 2012. Bruijn, Hans de, and Ernst ten Heuvelhof. Networks and decision making. Utrecht, Netherlands: Lemma Publishers, 2000. Davis, D., and W. Bouldin. "A cluster separation measure." IEEE Transactions on Pattern Analysis and Machine Intelligence 1, no. 2 (1979): 224-227. Farahat, Ahmed, and Mohamed S. Kamel. "Enhancing document clustering using hybrid models for semantic similarity." The eight workshop on text mining at the SIAM International conference on data mining (SDM). 2010. 83-92. Kalafatis, Themos . "Twitter analytics: cluster analysis reveals similar twitter users." Life analytics. May 2009. http://lifeanalytics.blogspot.ca/2009/05/twitter-analytics-cluster-analysis.html. Kennedy, Lyndon, Mor Naaman, Shane Ahern, Rahul Nair, and Tye Rattenbury. "How Flickr helps us make sense of the world: context and content in community contributed." Proceedings of the 15th international conference on Multimedia. ACM, 2007. Kireyev, Kirill, Leysia Palen, and Kenneth M. Anderson. "Applications of topics models to analysis of disaster-related Twitter data." 2009. Landauer, Thomas K, Peter W. Foltz, and Darrell Laham. "An introduction to latent semantic analysis." Discourse processes 25 (1998): 259-284. Nik Bakht, Mazdak, and Tamer E. El-diraby. "Analyzing infrastructure discussion networks: order of 'influence' in chaos of 'followers'." Csce annual conference-4th construction specialty conference . Montreal: CSCE , 2013a. Nik Bakht, Mazdak, and Tamer E. El-Diraby. "What does social media say about the infrastructure construction project?" Beijing, China: CIB W78, 2013b. Nik Bakht, Mazdak, and Tamer El-diraby. "Infrastructure discussion networks: analyzing social media debates of LRT projects in North American cities." TRB 93rd Annual meeting. Washington DC.: Transportation Research Board, 2014. 344-10 Papadopoulos, Symeon , Yiannis Kompatsiaris, Athena Vakali, and Ploutarchos Spyridonos. "Community detection in social media." J. data mining and knowledge discovery 24, no. 3 (2012): 515-554. Pitsilis, Georgios, Xiangliang Zhang, and Wei Wang. "Clusteing recommenders in collaborative filtering using explicit trust information." Proceedings of: Trust Management V. Copenhagen, Denmark: Springer, 2011. 82-97. Porter, M.,. "The Economic Performance of Regions." Regional Studies 37, no. 6-7 (2003): 545–546. Rangrej, Aniket, Sayali Kulkarni, and Ashish V. Tendlkar. "Comparative study of clustering techniques for short test documents." WWW 2011. Hyderabad: WWW, 2011. 111-112. Steinhaeuser, Karsten, and Nitesh V. Chawla. "Community detection in a large real-world social network." International conference on social computing, behavioral modeling and prediction. Phoenix, Arizona, USA: Springer, 2008. 168-175. Suganya, L., and B. Srinivasan. "Efficient semantic similarity based FCM for inferring user search goals with feedback sessions." International journal of computer trends & technolog 4, no. 9 (2013): 3316-3321. Xu, Tan, and Douglas W. Orad. "Wikipedia-based topic clustering for microblogs." Proceedings of the American society for information science & technology 48, no. 1 (2012). Zhang, Jiansong, and Nora El-gohary. "Semantic NLP-based information extraction from construction regulatory documents for automated compliance checking." Journal of computing in civil engineering, 2013. Zhou, Peng, and Nora El-gohary. "Semantic-bsed text classifiction of environemental regulatory documents for supporting automated environmental compliance checking in construction." Construction Research Congress (CRC-2014). Atlanta, Ga, USA: ASCE, 2014. 897-906.      344-11  The CSCE International Construction Specialty ConferenceUBC, Vancouver, BC June, 2015 Stakeholders of Urban infrastructure projects: Internal(directly involved in decision making process)▪ official decision makers (Government Dept.s & Private Sectors)▪ Technical decision makers (Planners, Contractors, suppliers, etc.) External(affected by the decisions made and related operations)▪ Local Community▪ General Public▪ Project Affected Groups▪ Local Neighborhood Members▪ Pressure Groups Such As NGOs. ▪ News Media[Atkin & Skitmore 2008]2 [External] Stakeholders Engagement: Public Involvement (PI Programs) OFFLINE▪ Public Meetings▪ Open Houses▪ Workshops▪ Surveys▪ …. ONLINE▪ Web 2.0“Micro-participation”[Evans-Cowley & Griffin 2012] 3 Micro-blogging (<140 characters) Created in 2006 645M users in 2015 (~120M active monthly) 135,000 new users joining everyday (Ave) 58 million tweets/day (Ave) = 9,100/sec 52 out of the 100 strategic infrastructure projects* in NA have had twitter accounts since 2012! *Based on the ‘North American Strategic Infrastructure Leadership Forum’4Infrastructure ProjectSOCIALECONOMICENVIRONMENTALPEOPLE IDEAS An Infrastructure Discussion Network (IDN):WHO?WHAT?SAYS5 Finding groups with Common Interests (Stakeholders Typology) Detecting the main Ideas discussed (Stakeholders’ Vested Interests) Profiling Communities & Core InterestsInfrastructure Project6 User profile descriptions Under 160 Character Most recent tweets Lexical vector space Terms as dimensions Pseudo-documents as vectors Semantic Transformation (LSI) SVD Semantic distance  Latent semantic similarities  Classes of followers Analysis of bio’s Classes of interest Analysis of Tweets Unsupervised Learning K-means Clustering Collecting data Interpretation of ResultsClusteringData Modeling & Pre-processingSemantic Distance Calculation7 Eglinton Cross Town LRT (TO, ON)  CAD8.4B project Connects east and west of Toronto Major parts underground (10 out of 19 km) Construction began in 2011 Now under construction89BIOBIOBIOBIOBIOBIO10 Each user as a pseudo-document (psd) A dictionary of all (say M) terms Each psd as a vector: [Semantic] Similarity between vectorsd3D:N followers (users) –N Pseudo- Documents11 K-means Clustering12 Selection of K:13Cluster1 Cluster2 Cluster3 Cluster4Size 157 119 78 53fan transit city cultureenthusiast York news interestedjunkie Lrt civic transitcoffee Mo`m account citiesnerd Work world housingpolitical TTC awesome enthusiastbeer metrolinx business photographerfather updates mayor designmusic association Town_hall musicviews Area ward plannermember village senior transportcyclist good twitter issuesnews enjoy things economicswriter project active interestsavid theatre citizen placeaddict construction resident fandesigner leaside affairs avidtechnology works district baseballgrad transport job humanrunner environment advisory pop_cultureresident chair progressive policy 14crosstownTOEach Edge is the notion of a form of friendship/followership: A follows BEach Node  shows: One project follower[Nik-Bakh & El-diraby 2012~13] Community-detection through SNA (bottom-up)15Copy right: Mazdak Nik Bakht, I2C, University of Toronto 2012[Nik-Bakh & El-diraby, 2012]16C1 C2 C3 C4university go waste journalisteglinton public comms cityhalldavisville area energy freelanceindusry construction strategist torontoisttown hamilton sustainable authorupdates dedicated commuter civicair #ttc economics magazineadventure gta ryersonu writebusiness #ttchelps ceo neighborhoodtv #ttcnotices councillor columnist[Nik-Bakh & El-diraby, 2013]17C3Policy MakersC1Political Copy right: Mazdak Nik Bakht, I2C, University of Toronto 2012Results of Top-down& Bottom-upAnalyses Coincide!18 Economy  XXX (representing monetary values), Dollars, Dollars_Risk, Investing, Millions,  Oaarchitects, Construction_Safety, Mega_Contract Technical features  Construction, Safety, Contract, Consultant, Contractor, Procurement, Procurement_Process, LRT_Project, and Tunnel_Boring Community related issues, local/regional aspects  Neighbourhood, Community, Neighbourhood_Revitalization, Eglinton, Yonge_Bus, Finch, and Finch_LRT, Realestate Political issues Councilor, City_Staff, Provincial_Government, Harper A small cluster with no specific theme Manual screening: News and updates about the project, its schedule and improvement, and public meetings19 Stakeholders typology Bottom-up (Through SNA) Top-down (Through semantic clustering) Topic detection Subjectivity of interpretation Existing anomalies  Future work Semantic distance through an external knowledge-base/taxonomy  Higher levels of automation(?)202122Lexical Space(Ordered)Term statistical analysisCorpus(Chaotic)Data collectionPre-processing• User Descriptions• Tweets• Tokenizing• Noise filtering• Feature extractionSemantic Space(Meaningful) Topic classificationSemantic Clustering• K-means Clustering•Core interests•Main concepts discussedTextprocessing  Semantic Transformation AnalysisLSI 23 K-means Clustering24

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.52660.1-0076448/manifest

Comment

Related Items