UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Graph-based language grounding Bajaj, Mohit 2019

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2019_september_bajaj_mohit.pdf [ 11.39MB ]
Metadata
JSON: 24-1.0380482.json
JSON-LD: 24-1.0380482-ld.json
RDF/XML (Pretty): 24-1.0380482-rdf.xml
RDF/JSON: 24-1.0380482-rdf.json
Turtle: 24-1.0380482-turtle.txt
N-Triples: 24-1.0380482-rdf-ntriples.txt
Original Record: 24-1.0380482-source.json
Full Text
24-1.0380482-fulltext.txt
Citation
24-1.0380482.ris

Full Text

Graph-based Language GroundingbyMohit BajajB.Eng., University of Delhi, 2015A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University of British Columbia(Vancouver)August 2019c©Mohit Bajaj, 2019The following individuals certify that they have read, and recommend to theFaculty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:Graph-based Language Groundingsubmitted by Mohit Bajaj in partial fulfillment of the requirements forthe degree of Master of Sciencein Computer ScienceExamining Committee:Leonid Sigal, Computer ScienceSupervisorJames Little, Computer ScienceSecond ReaderiiAbstractIn recent years, phrase (or more generally language) grounding has emerged as afundamental task in computer vision. Phrase grounding is a generalization of moretraditional computer vision tasks with the goal of localizing a natural languagephrase spatially in a given image. Most recent work use state-of-the-art deep learn-ing techniques to achieve good performance on this task. However, they do notcapture complex dependencies among proposal regions and phrases that are cru-cial for the superior performance on the task. In this work we try to overcomethis limitation through a model that makes no assumptions regarding the underly-ing dependencies in both of the modalities. We present an end-to-end frameworkfor grounding of the phrases in images that uses graphs to formulate more com-plex, non-sequential dependencies among proposal image regions and phrases. Wecapture intra-modal dependencies using a separate graph neural network for eachmodality (visual and lingual), and then use conditional message-passing in anothergraph neural network to fuse their outputs and capture cross-modal relationships.This final representation is used to make the grounding decisions. The frameworksupports many-to-many matching and is able to ground single phrase to multipleimage regions and vice versa. We validate our design choices through a seriesof ablation studies and demonstrate state-of-the-art performance on the Flickr30kEntities dataset and the ReferIt Game dataset.iiiLay SummarySpatial localization of text in images has multiple applications in the field of com-puter vision and can be found at the core of HCI and HRI systems. The task ischallenging because the space of natural language phrases is exponentially large ascompared to other tasks such as object detection and semantic segmentation. Mostprevious work leverages deep networks to achieve good performance but does notcapture complex dependencies among the phrases and image regions that can leadto better results. In this thesis, we propose a model that uses graphs to capturecomplex dependencies among both modalities and achieve superior performanceon this task.ivPrefaceThe entire work presented here is original work done by the author, Mohit Bajaj,in collaboration with Lanjun Wang and under the supervision of Dr. Leonid Sigal.A version of this work has been accepted for publication:• M. Bajaj, L. Wang and L. Sigal. GraphGround: Graph-based LanguageGrounding. In IEEE International Conference on Computer Vision(ICCV),2019Design, implementation and the experiments were done by me. Leonid providedme the feedback during each step and Lanjun helped me in refining the model. Theinitial draft of the paper was written by me, and was later revised by Lanjun andLeonid.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . 41.1.1 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.1.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Method Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . 62 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1 Multimodal learning . . . . . . . . . . . . . . . . . . . . . . . . 72.1.1 Cross-modal transfer . . . . . . . . . . . . . . . . . . . . 72.1.2 Cross-modal interpretation . . . . . . . . . . . . . . . . . 82.1.3 Joint multimodal processing . . . . . . . . . . . . . . . . 8vi2.2 Phrase Grounding . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Graph Neural Networks (GNNs). . . . . . . . . . . . . . . . . . . 103 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.1 Text and Visual Encoders . . . . . . . . . . . . . . . . . . . . . . 133.2 G3RAPHGROUND Network . . . . . . . . . . . . . . . . . . . . . 143.2.1 Phrase Graph . . . . . . . . . . . . . . . . . . . . . . . . 143.2.2 Visual Graph . . . . . . . . . . . . . . . . . . . . . . . . 153.2.3 Fusion Graph . . . . . . . . . . . . . . . . . . . . . . . . 163.2.4 Prediction Network . . . . . . . . . . . . . . . . . . . . . 173.2.5 Post Processing . . . . . . . . . . . . . . . . . . . . . . . 183.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.1 Setup and Inference . . . . . . . . . . . . . . . . . . . . . . . . . 204.2 Datasets and Evaluation . . . . . . . . . . . . . . . . . . . . . . . 204.3 Results and Comparison . . . . . . . . . . . . . . . . . . . . . . 224.4 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . 244.5 Ablation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.6 Multi-box evaluation . . . . . . . . . . . . . . . . . . . . . . . . 274.7 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . 285 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31viiList of TablesTable 4.1 State-of-the-art comparison on Flickr30k. Phrase grounding ac-curacy on the test set reported in percentages. . . . . . . . . . 21Table 4.2 Phrase grounding accuracy comparison over coarse categorieson Flickr30k dataset. Refer table 4.3 for the category names.The models with (*) as suffix are finetuned. . . . . . . . . . . . 22Table 4.3 Category mappings from ids to names. . . . . . . . . . . . . . 22Table 4.4 State-of-the-art comparison on ReferIt Game. Phrase groundingaccuracy on the test set reported in percentages. . . . . . . . . 23Table 4.5 Ablation results. Flickr30k and ReferIt Game datasets. . . . . . 26Table 4.6 Box level accuracy on Flickr30k Entities dataset. . . . . . . . . 28Table 4.7 Effect of k on the accuracy of G3RAPHGROUND++. . . . . . . 28viiiList of FiguresFigure 1.1 Illustration of G3RAPHGROUND. Two separate graphs are formedfor the phrases and the image regions respectively, and are thenfused together to make the final grounding predictions. Thecolored bounding-boxes correspond to the phrases in the samecolor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2Figure 3.1 G3RAPHGROUND Architecture. The phrases are encoded intothe phrase graph while image regions are extracted and en-coded into the visual graph. The fusion graph is formed byindependently conditioning the visual graph on each node ofthe phrase graph. The output state of each node of the fusiongraph after message-passing is fed to the prediction networkto get the final grounding decision. . . . . . . . . . . . . . . . 13Figure 4.1 Sample attention results for visual graph. Aggregated attentionover each image region projected in an image. . . . . . . . . . 24Figure 4.2 Sample results obtained by G3RAPHGROUND on Flickr30kEntities dataset. The colored bounding-boxes correspond tothe phrases in same color. . . . . . . . . . . . . . . . . . . . 25Figure 4.3 Sample results obtained by G3RAPHGROUND on ReferIt Gamedataset. The colored bounding-boxes correspond to the phrasesin same color. . . . . . . . . . . . . . . . . . . . . . . . . . . 25ixAcknowledgmentsFirst and foremost, I would like to extend my heartfelt gratitude to my supervisorDr. Leonid Sigal for his support, feedback and patience. He always encouragedme to explore new ideas and was a constant source of motivation to me during mythesis. I will always be grateful to him for accepting me in his research group andagreeing to nurture me into a researcher. I would also like to thank Prof. James J.Little for taking his time out and agreeing to be the second reader of my thesis.Next I would like to thank Lanjun Wang for collaborating with us on thisproject. She kept me motivated and helped me with the ideas through discussions.I would also like to thank my lab-mate Suhail for the discussions and motivatingme to pursue this problem. Thanks should also get to Gursimran, Siddhesh and myother lab-mates for being such great colleagues and being kind and helpful.I would like to thank the Department of Computer Science of University ofBritish Columbia (UBC) for providing this platform and supporting me financiallyas a teaching assistant. I got the wonderful opportunity to learn from some amaz-ing instructors during my coursework. I would like to extend my gratitude to allmy course instructors. This work was supported by Huawei Technologies. A bigthanks to them as well. I am also thankful to Vector Institute for AI for providingme the financial and infrastructure support.Last but not the least, I would forever be grateful and indebted to my familyfor their unconditional love and support. None of this would have been possiblewithout the sacrifices and blessings of my mom. Special thanks to my best friendTanvi for always supporting me and keeping me sane during these years.xChapter 1IntroductionIn last few years, phrase (or more generally language) grounding has emerged asa fundamental task in computer vision. Phrase grounding is a generalization ofmore traditional computer vision tasks, such as object detection [14] and semanticsegmentation [32]. Grounding requires spatial localization of free-form linguisticphrases in images. The core challenge is that the space of natural phrases is ex-ponentially large, as compared to, for example, object detection or segmentationwhere the label sets are typically much more limited (e.g., , 80 categories in MSCOCO [22]). This exponential expressivity of the label set necessitates amortizedlearning, which is typically formulated using continuous embeddings of visual andlingual data. Despite challenges, phrase grounding emerged as the core problemin vision due to the breadth of applications that span image captioning [23], visualquestion answering [2, 45] and referential expression recognition [24] (which is atthe core of many HCI and HRI systems).Significant progress has been made on the task in the last couple of years,fueled by large scale datasets (e.g., , Flickr30k [27] and ReferIt Game [17]) andneural architectures of various forms. Most approaches treat the problem as oneof learning an embedding where class-agnostic region proposals [30] or attendedimages [11, 38] are embedded close to the corresponding phrases. A variety ofembedding models, conditional [29] and unconditional [16, 33], have been pro-posed for this task. Recently, the use of contextual relationships among the regionsand phrases has started to be explored and shown to substantially improve the per-1Figure 1.1: Illustration of G3RAPHGROUND. Two separate graphs areformed for the phrases and the image regions respectively, and are thenfused together to make the final grounding predictions. The coloredbounding-boxes correspond to the phrases in the same color.formance. Specifically, [12] and [8] encode the context of previous decisions byprocessing multiple phrases sequentially, and/or contextualizing each decision byconsidering other phrases and regions [12]. A non-differentiable process usingpolicy gradient is utilized in [8], while [12] uses an end-to-end differentiable for-mulation using LSTMs. In both cases, the contextual information is modeled usingsequential propagation (e.g., , using LSTMs [8, 12]).In reality, contextual information in the image, e.g., among the proposed re-gions, can hardly be regarded as sequential. Same can be argued for phrases, par-ticularly in cases where they do not come from an underlying structured source like2a sentence (which is explicitly stated as an assumption and limitation of [12]). Inessence, previous methods impose sequential serialization of fundamentally non-sequential data for convenience. We posit that addressing this limitation explicitlycan lead to both better performance and a more sensibly structured model. Cap-italizing on recent advances in object detection that have addressed conceptuallysimilar limitations with the use of transitive reasoning in graphs (e.g., , using con-volutional graph neural networks [18, 21, 41]), we propose a new graph-basedframework for phrase grounding. To our knowledge, this work is the first to ex-plore graph neural architectures for phrase grounding. Markedly, this formulationallows us to take into account more complex, non-sequential dependencies amongboth proposal image regions and the linguistic phrases that require grounding.Specifically, as illustrated in Figure 1.1, region proposals are first extractedfrom the image and encoded, using a CNN and bounding-box coordinates, intonode features of the visual graph. The phrases are similarly encoded, using a bi-directional RNN, into node features of the phrase graph. The strength of connec-tions (edge weights) between the nodes in both graphs are predicted based on thecorresponding node features and the global image/caption context. Gated GraphNeural Networks (GG-NNs) [21] are used to refine the two feature representationsthrough a series of message-passing iterations. The refined representations are thenused to construct the fusion graph for each phrase by fusing the visual graph withthe selected phrase. Again the fused features are refined using message-passing inGG-NN. Finally, the fused features for each node, which correspond to the encod-ing of <phrasei, image region j> tuples, are used to predict the probabilityof grounding phrasei to image region j. These results are further refined bya simple scheme that does non-maxima suppression (NMS), and predicts whethera given phrase should be grounded to one or more regions. The final model, wecall G3RAPHGROUND, is end-to-end differentiable and is shown to produce state-of-the-art results.While we clearly designed our architecture with phrase grounding in mind, wewant to highlight that it is much more general and would be useful for any multi-modal assignment problem where some contextual relations between elements ineach modality exist, for example, text-to-clip [39] / caption-image [19, 44] retrievalor more general cross-modal retrieval and localization [3].3Contributions: We make several contributions through this work. First, we pro-pose a novel graph-based grounding architecture which consists of three connectedsub-networks (visual, phrase and fusion) implemented using Gated Graph NeuralNetworks. Our design is modular and can model rich context both within a givenmodality and across modalities, without making strong assumptions on sequentialnature of data. Second, we show how this architecture can be learned in an end-to-end manner effectively. Third, we propose a simple but very effective refinementscheme that in addition to NMS helps to resolve one-to-many groundings. Finally,we validate our design choices through a series of ablation studies, and illustrateup to 5.33% and 10.21% better than state-of-the-art performance on the Flickr30k[27] and the ReferIt Game [17] datasets.1.1 Problem DefinitionThe problem we are trying to address in this work is grounding free languagephrases in images. More formally, given an image and a phrase or a descriptionalready parsed into multiple phrases, the task is to spatially localize each phrase toa specific region or multiple regions of the image. Each phrase can be groundedto multiple regions and multiple phrases are also allowed to be grounded to thesame region. For instance, the phrase people can be grounded to multiple imageregions where each region corresponds to a single person. Here are some desiredproperties of the model:• It should be able to capture intra-modal dependencies for each modality• It should be able to capture cross-modal dependencies between the text andthe image• It should be able to deal with complex free language descriptions that areparsed into phrases• It should support one-to-many matching for phrases and image-regions andvice versa41.1.1 ScopeWe make some assumptions to limit the scope of the problem. We assume that ei-ther phrases to be grounded exist independently as in ReferIt Game dataset [17] orthey are available as parsed from a given image description/caption as in [27]. Sec-ondly, we rely on existing pre-trained region proposal networks to extract image-regions from the given image.1.1.2 DataWe validate our model on the Flickr30k [27] and the Referit Game [17] datasets.Flickr30k contains 31,783 images where each image is annotated with five cap-tions/sentences. Each caption is further parsed into phrases, and the correspondingbounding-box annotations are available. A phrase may be annotated with morethan one ground-truth bounding-box, and a bounding-box may be annotated tomore than one phrase. We use the same dataset split as previous work [27, 29]which use 29,783 images for training, 1000 for validation, and 1000 for testing.Referit Game dataset contains 20,000 images and we use he same split as usedin [15, 29] where we use 10,000 images for training and validation while the other10,000 for testing. Each image is annotated with multiple referring expressions(phrases) and corresponding bounding-boxes. We note that the phrases corre-sponding to a given image of this dataset do not come from a sentence but existindependently.1.2 Method OutlinePhrase grounding is a challenging many-to-many matching problem where a singlephrase can, in general, be grounded to multiple regions, or multiple phrases canbe grounded to a single image region. The G3RAPHGROUND framework usesgraph networks to capture rich intra-modal and cross-modal relationships betweenthe phrases and the image regions. We illustrate the architecture in Figure 3.1.We assume that the phrases are available, e.g., parsed from an image caption (seeFlickr30k [27] dataset) or exist independently for a single image (see ReferIt Game[17] dataset).We encode these phrases using a bi-directional RNN that we call the phrase5encoder. These encodings are then used to initialize the nodes of the phrase graphthat is built to capture the relationships between the phrases. Similarly, we formthe visual graph that models the relationships between the image regions that areextracted from an image using RPN and then encoded using the visual encoder.The caption and the full image provide additional context information that we useto learn the edge-weights for both graphs. Message-passing is independently donefor these graphs to update the respective node features. This allows each phrase/im-age region to be aware of other contextual phrases/image regions. We finally fusethe outputs of these two graphs by instantiating one fusion graph for each phrase.We concatenate the features of all nodes of the visual graph with the feature vectorof a given node of the phrase graph to condition the message-passing in this newfusion graph.The final state of each node of the fusion graph, which corresponds to a pair<phrasei, image region j>, is fed to a fully connected prediction networkto make a binary decision if phrasei should be grounded to image region j.Note that all predictions are implicitly inter-dependent due to series of message-passing iterations in three graphs. We also predict if the phrase should be groundedto a single or multiple regions and use this information for post processing to refineour predictions.1.3 Thesis OrganizationWe have organized this thesis as follows: first, in Chapter 2, we review the relatedworks and the literature relevant to the problem. We discuss different methods andtechniques proposed in the past years to address the problem of spatial localiza-tion of text in images. We also discuss Graph Neural Networks and their recentapplications that motivate our work. In Chapter 3, we describe each componentof our model in detail. We also discuss the implementation details and set-up fortraining and inference. In Chapter 4, we discuss the experiments carried out withthis network demonstrating its effectiveness as compared to other works. We alsodiscuss the results of ablation studies that justify our model design choices. Wealso present some qualitative results at the end of this section. Finally in Chapter5, we highlight the contributions of our work and conclude the thesis.6Chapter 2Related WorkOur task of language (phrase) grounding is related to rich literature on multimodallearning; with architectural design building on recent advances in Graph NeuralNetworks. We review the most relevant literature and point the reader to recentsurveys [1, 5] and [37, 46] for added context.2.1 Multimodal learningThe term ‘multimodal’ generally involves different modalities that refer to dif-ferent sensory input such as audio, visual, touch, smell and taste. As discussedin [5], multimodal tasks can be classified based on the information flow betweenmodalities into cross-modal transfer, cross-modal interpretation and joint modalprocessing.2.1.1 Cross-modal transferThis approach assumes that processing occurs in domain specific modules that donot interact with each other. This was inspired by the theory of the modularity ofthe mind (Fodor, 1985). Earlier approaches to multimodal engineering modeledthe flow of information in one modality independent of the information flow inanother one. Finally, the results are aligned to each other at the end. The searchand retrieval task is a classical example where the search query is provided as text toenquire an about an image, video or text file from database [4]. For speech related7tasks, the query needs to be alligned to the audio samples. In such tasks, inputprocessing in one modality is not directly dependent upon the information flowfrom the output modality. Finding appropriate allignments and mappings betweenthe modalities is the challenging and crucial task for this view.2.1.2 Cross-modal interpretationGenerally the goal of this view is to use structured representation of one modality toobtain useful representation of other modality. The concept of attention has becomevery popular as a mediator between two modalities for cross-modal interpretation.Bridwell and Bello [6] argue that it serves as a bottleneck for information flow in acognitive system.For instance, in image captioning [40], attention is used to attend to importantregions of the image to produce the textual description. On the other hand, workslike [20] try to generate visual representations for the summarization of documents.2.1.3 Joint multimodal processingThis view tries to jointly model multiple modalities by trying to combine the infor-mation flow such that interpretation of one modality aids in the interpretation of theothers. For instance, visual question answering (VQA) [25] requires interpretationof the question, interpretation of image features with respect to the question andgeneration of the answer coherent to the question. The main challenge that lies inthis way of modelling is to find an efficient mechanism to combine both modalities.Jointly modelling both text and image is a popular choice for the recent works. Wealso propose an efficient mechanism in this work to combine both modalities.Now we discuss some prior work that focusses on the multimodal task of phrasegrounding.2.2 Phrase GroundingPrior work, such as Karpathy et al. [16], draw an inspiration from image retrievalworks that project both modalities into a common regularized subspace. Insteadthey propose to align sentence fragments and image regions in a subspace. Todo this, they introduce a structured max-margin objective that allows their model8at associate sentence fragments to image-regions. They experimentally show thataligning the sub-parts of text and image also helps in image retrieval task. Sim-ilarly, Wang et al. [35] propose a structured matching approach that encouragesthe semantic relations between phrases to agree with the visual relations betweenregions. They formulate this as a discrete optimization problem and relax it to alinear program. This structured matching is integrated with the neural networksthat are trained end-to-end. In [33], Wang et al. propose learning a joint visual-textembedding using two branch neural networks with multiple layers of linear projec-tions followed by non-linearities. They use a large margin objective that combinescross-view ranking constraints with within-view neighborhood structured preser-vation constraints. The idea is further extended in [34] that uses two differentnetwork structures that produce different output representations. The first one isan embedding network that learns an explicit shared latent embedding space witha max-margin loss and neighborhood constraints. The second network is a similar-ity network that fuses two branches via element-wise product and is trained withregression loss. Plummer et al. [29] further build on this idea with a model thatjointly learns multiple text-conditioned embeddings in a single end-to-end model.They propose a concept weight branch that automatically assigns the phrases tothe embeddings instead of pre-defining such assignments. This allows the modelto separate phrases into groups and learn conditional embeddings for these groupsin a single end-to-end model.It has been shown that both textual and visual context information can aidphrase grounding. Plummer et al. [28] perform global inference using a widerange of visual-text constraints from attributes, verbs, prepositions, and pronouns.For instance, relationships between people and clothing or body parts can be usefulin distinguishing individuals. However, these cues are predefined and the weightsfor these cues are learnt. Chen et al. [8] try to leverage the semantic and spatialrelationships between the phrases and corresponding visual regions by proposinga context policy network that accounts for the predictions made for other phraseswhen localizing a given phrase. They also propose and finetune the query-guidedregression network to boost the performance by obtaining better proposals andfeatures. SeqGROUND [12] uses the full image and sentence as the global contextwhile formulating the task as a sequential and contextual process that conditions9the grounding decision of a given phrase on previously grounded phrases of thesentence. They encode image-regions and all phrases into two stacks of LSTMcells, along with already grounded phrase-region pairs into another LSTM whichprovides context for the grounding of the next phrase.We formulate the problem of phrase grounding using Graph Neural Networksthat allow us to jointly model the grounding decision of all the given phrases. Wenow discuss Graph Neural Networks and their recent applications in the field ofmultimodal learning.2.3 Graph Neural Networks (GNNs).Graph Convolution Networks were first introduced in [18] for semi-supervisedclassification. Each layer of GCN can perform localized computations involv-ing neighbourhood nodes. These layers can be further stacked to form a deepernetwork that is capable of performing complex computations on graph data. Thepropagation in the network is based on a first-order approximation of spectral con-volutions on graphs. For a two layered graph network this can be written as:Z = F(X ,A) = so f tmax(Aˆ ReLU(AˆXW (0))W (1)) (2.1)where Aˆ = D−12 AD−12 . A and D are adjacency and degree matrices respectively.W (0) is an input to hidden weight matrix for a hidden layer and W (1) is a hidden-to-output weight matrix.In vision, Yang et al. [42] enhanced GCNs with attention and found them to beeffective for scene graph generation. They design the Relation Proposal Network topropose relations between the objects and use attentional Graph Convolution Net-work to effectively capture contextual information between objects and relations.[36] deploy the GCNs to model videos as space-time graphs and get impressiveresults for video classification task. The graph nodes are obtained from the ob-ject region proposals extracted from different frames. These nodes are connectedby two types of edges: similarity relations that capture the similarity between thenodes and spatial-temporal relations that capture the interactions between nearbyobjects. Visual reasoning among image regions for object detection, using GCNs,10was shown in [9] and served as conceptual motivation for our visual graph sub-network.Recently, [41] present a theoretical framework for analyzing the expressivepower of GNNs to capture different graph structures. They mention that message-passing in GNNs can be described by two fuctions: AGGREGATE and COM-BINE. The AGGREGATE function aggregates the messages from the neighbour-hood nodes and the COMBINE function updates the state of each node by combin-ing the aggregated message and the previous state of each node. They prove thatchoice of these functions is crucial to the expressive power of GNNs. Li et al. [21]propose Gated Graph Neural Networks (GG-NNs) that use Gated Recurrent Units(GRUs) introduced by Cho et al. [10] for the gating in the COMBINE step whichcan be described as:zv(t) = σ(Wz · [hv(t−1),av(t)])rv(t) = σ(Wr · [hv(t−1),av(t)])h˜v(t) = tanh(W · [rv(t)∗hv(t−1),av(t)])hv(t) = (1− zv(t)) ·hv(t−1)+ zv(t) · h˜v(t)(2.2)where hv(t−1) and hv(t) are the states of node v before and after the tth iterationrespectively, and av(t) is the aggregated message received by the node from itsneighborhood during tth iteration.Our model is inspired by these works. We use one GG-NN to model the spa-tial relationships between the image regions and another to capture the semanticrelationships between the phrases. We finally use the third GG-NN for the fusionof the text and visual embeddings obtained from the corresponding graphs. Outputof the fusion network is used to predict if a given phrase should be grounded to aspecific image region or not.11Chapter 3ApproachPhrase grounding is a challenging many-to-many matching problem where a singlephrase can, in general, be grounded to multiple regions, or multiple phrases canbe grounded to a single image region. The G3RAPHGROUND framework usesgraph networks to capture rich intra-modal and cross-modal relationships betweenthe phrases and the image regions. We illustrate the architecture in Figure 3.1.We assume that the phrases are available, e.g., parsed from an image caption (seeFlickr30k [27] dataset) or exist independently for a single image (see ReferIt Game[17] dataset).We encode these phrases using a bi-directional RNN that we call the phraseencoder. These encodings are then used to initialize the nodes of the phrase graphthat is built to capture the relationships between the phrases. Similarly, we formthe visual graph that models the relationships between the image regions that areextracted from the image using RPN and then encoded using the visual encoder.Caption and full image provide additional context information that we use to learnthe edge-weights for both graphs. Message-passing is independently done for thesegraphs to update the respective node features. This allows each phrase/image re-gion to be aware of other contextual phrases/image regions. We finally fuse theoutputs of these two graphs by instantiating one fusion graph for each phrase. Weconcatenate the features of all nodes of the visual graph with the feature vector of agiven node of the phrase graph to condition the message-passing in this new fusiongraph.12The final state of each node of the fusion graph, that corresponds to a pair<phrasei, image region j>, is fed to a fully connected prediction networkto make a binary decision if phrasei should be grounded to image region j.Note that all predictions are implicitly inter-dependent due to a series of message-passing iterations in three graphs. We also predict if the phrase should be groundedto a single or multiple regions and use this information for post processing to refineour predictions.Figure 3.1: G3RAPHGROUND Architecture. The phrases are encoded intothe phrase graph while image regions are extracted and encoded into thevisual graph. The fusion graph is formed by independently conditioningthe visual graph on each node of the phrase graph. The output stateof each node of the fusion graph after message-passing is fed to theprediction network to get the final grounding decision.3.1 Text and Visual EncodersPhrase Encoder. We assume one or more phrases are available and need to begrounded. Each phrase consists of a word or a sequence of words. We encode eachword using its GLoVe [26] embedding and then encode the complete phrase usingthe last hidden state of a bi-directional RNN. Finally, we obtain phrase encodingsp1,. . . ,pn for the corresponding n input phrases P1 ... Pn.13Caption Encoder. We use another bi-directional RNN to encode the completeinput caption C and obtain the caption encoding cenc. This is useful as it providesglobal context information missing in the encodings of individual phrases.Visual Encoder. We use a region proposal network (RPN) to extract region pro-posals R1 ... Rm from an image. Each region proposal Ri is fed to the pre-trainedVGG-16 network to extract 4096-dimensional vector from the first fully-connectedlayer. We transform this vector to 300 dimensional vector ri by passing it througha network with three fully-connected layers with ReLU activations and a batchnormalization layer at the end.Image Encoder. We use same architecture as the visual encoder to also encode thefull image into the corresponding 300 dimensional vector ienc that serves as globalimage context for the grounding network.3.2 G3RAPHGROUND Network3.2.1 Phrase GraphTo model relationships between the phrases, we construct the phrase graph G Pwhere nodes of the graph correspond to the phrase encodings and the edges cor-respond to the context among them. The core idea is to make grounding decisionfor each phrase dependent upon other phrases present in the caption. This pro-vides with the important context for the grounding of the given phrase. Formally,G P = (V P,E P) where V P are the nodes corresponding to the phrases and E P arethe edges connecting these nodes. We model this using Gated Graph Neural Net-work where AGGREGATE step of the message-passing for each node v ∈ V P canbe described asaPv (t) = AGGREGAT E({hPu (t−1) : u ∈N (v)})= ∑u∈N (v){APu,v(WPk ·hPu (t−1))}, (3.1)where aPv (t) is the aggregated message received by node v from its neighbourhoodN during tth iteration of message-passing, hPu (t − 1) is a d-dimensional feature14vector of phrase-node u before tth iteration of message-passing, WPk ∈ Rd×d is alearnable d×d dimensional graph kernel matrix, and APu,v corresponds to the scalarentry of learnable adjacency-matrix that denotes the weight of the edge connectingthe nodes u and v.We initialize hPu (0) with the corresponding phrase encoding pu ∈ Rd producedby the phrase encoder. To obtain each entry of the adjacency-matrix APu,v, weconcatenate the caption embedding (cenc), the full image embedding (ienc) and thesum of corresponding phrase embeddings: pu and pv. The concatenated feature ispassed through a two layer fully-connected network fad j followed by sigmoid:APu,v = APv,u = σ( fad j(Concat(pu+pv,cenc, ienc))). (3.2)The aggregated message aPv (t) received by node v is used to update the state ofnode v during tth iteration:hPv (t) =COMBINE({hPv (t−1),aPv (t)}) (3.3)We use GRU gating in the COMBINE step as proposed by [21]. After k (k = 2for all experiments) stages of message-passing on this graph-network, we obtainhPv (k) that encodes the final state for the phrase node v ∈ V P of the phrase graph.The final state of each node is then used in the fusion step.3.2.2 Visual GraphSimilarly, we instantiate another GG-NN to model the visual graph G V that modelsthe spatial relationships between the image regions present in the image. Each nodeof the graph corresponds to an image region extracted from RPN. To initialize thestates of these nodes, we use the encoded features of the image regions producedby visual encoder, and concatenate them with the position of the correspondingimage region in the image denoted by four normalized coordinates. V V denotesthe nodes of the visual graph G V . The AGGREGATE step of message-passing onthis network for each node v ∈VV can be described as:aVv (t) = ∑u∈N (v){αu(WVk ·hVu (t−1))}, (3.4)15where we initialize hVu (0) with the vector [ru, xminu , yminu , xmaxu , ymaxu ] which isobtained after concatenating the visual encoding (ru) of the uth image region andits normalized position, αu represents the attention weight given to the node uduring the message-passing. To obtain αu, we concatenate the visual encoding ruof that node with the caption encoding cenc and the full image encoding ienc, andthen pass this vector through a fully-connected network fattn followed by sigmoid:αu = σ( fattn(Concat(ru,cenc, ienc))). (3.5)This is similar to the AGGREGATE step of message-passing on the phrasegraph except we do not learn the complete adjacency matrix for this graph. We notethat it is computationally expensive to learn this matrix as the number of entriesin the adjacency matrix increase quadratically with the increase in number of theimage regions. Instead we use unsupervised-attention α over the nodes of thevisual graph to decide the edge-weights. All edges that originate from the node uare weighted αu where αu ∈ [0,1].Similar to the phrase graph, we use the GRU mechanism [21] for the COM-BINE step of message-passing on this graph. After k stages of message-passingon this graph-network we obtain hVv (k) that encodes the final state for the imageregion node v ∈ VV of the visual graph. The updated visual graph is conditionedon each node of the phrase graph in the fusion step that we explain next.3.2.3 Fusion GraphAs we have phrase embeddings and image region embeddings from the phrasegraph and the visual graph respectively, the fusion graph is designed to merge theseembeddings before grounding decisions are made. One fusion graph is instantiatedfor each phrase by assigning the selected phrase node from the phrase graph on allnodes of the visual graph. We concatenate features of all the nodes of the visualgraph with the node features of the selected phrase node from the phrase graph.That is to say, the fusion graph has the properties: 1) it has the same structure (i.e.,the number of nodes as well as the adjacency matrix) as the visual graph; 2) thenumber of fusion graphs instantiated is the same as the number of nodes in thephrase graph. We can also recognize this graph as visual graph conditioned on a16node from the phrase graph.After k iterations of message-passing in the fusion graph, we use the final stateof each node to predict the grounding decision for the corresponding image regionwith respect to the phrase on which the corresponding fusion graph was condi-tioned. This is independently repeated for all of the phrases by instantiating a newfusion graph from the visual graph for each phrase, and conditioning the message-passing in this new graph on the selected phrase node of the phrase graph. Notethat it may seem that message-passing on the fusion graphs occur independentlyfor each phrase but it’s not true. Each phrase embedding that is used to conditionmessage-passing on the fusion graph is output of the phrase graph, and hence, isaware of other phrases present in the caption.Let G Fi denote the fusion graph obtained by conditioning the visual graph onnode i of the phrase graph. The initialization of node j in this fusion graph can bedescribed as:hFij (0) =Concat(hPi (k),hVj (k)),∀ j ∈ V V (3.6)where hVj (k) corresponds to the final feature vector of node j in the visual graphand hPi (k) is the final feature vector of the selected node i in the phase graph.The AGGREGATE and COMBINE steps of message-passing on each fusiongraph remain same as described for the visual graph in Eqs. (3.4) and (3.3).3.2.4 Prediction NetworkWhile grounding, we predict a scalar dˆi j for each phrase-region pair that denotesthe probability whether the phrase Pi is grounded to the image region R j. Theprobability of this decision conditioned on the given image and caption can beapproximated from the fused-embedding of that image region conditioned on thegiven phrase. We pass the fused-embedding of the node j of the fusion graph G Fithrough the prediction network fpred which consists of three fully-connected layerswith interleaved ReLU activations and a sigmoid function at the end.P(dˆi j = 1|hFij (k)) = σ( fpred(hFij (k))) (3.7)173.2.5 Post ProcessingNote that a given phrase may be grounded to a single or multiple regions. We findthat the model’s performance can be significantly boosted if we post process thegrounding predictions for two cases separately. Hence, we predict a scalar βˆv foreach phrase v ∈ V P which denotes the probability of the phrase to be groundedto more than one image region. We pass the updated phrase-embedding hPv (k)of node v obtained from the phrase graph through a 2-layered fully-connectednetwork fcount :βˆv = σ( fcount(hPv (k))), (3.8)If βˆv is greater than 0.5, we select those image regions for which the output ofthe prediction network are above a fixed threshold and then apply non-maximumsuppression (NMS) as a final step. Otherwise, we simply ground the phrase to theimage region with the maximum decision probability output from the predictionnetwork.3.3 TrainingWe pre-train the encoders to provide them with good initialization for end-to-endlearning. First, we pre-train the phrase encoder in autoencoder format, and thenkeeping it fixed, we pre-train the visual encoder using a ranking loss. The lossenforces the cosine similarity Sc(.) between the phrase-encoding and the visual-encoding for ground-truth pair (pi,r j) to be more than that of a contrastive pair byleast the margin γ:L =∑(Ep˜6=pimax{0,γ−SC(pi,r j)+SC(p˜,r j)}+Er˜6=r j max{0,γ−SC(pi,r j)+SC(pi, r˜)})(3.9)where r˜ and p˜ denote randomly sampled constrastive image region and phraserespectively. The caption encoder and the image encoder are pre-trained in similarfashion. After pre-training of the encoders, we jointly train all the modules inend-to-end fashion.For end-to-end training, we formulate this as a binary-classification task wherethe model predicts grounding decision for each phrase-region pair. We minimize18binary cross-entropy loss BCE(.) between the model prediction and the ground-truth label. We also jointly train fcount and apply binary cross-entropy loss for thebinary-classification task of predicting if a phrase should be grounded to a singleregion or multiple regions. The total training loss is described as:Ltrain = BCE(dˆi, j,di, j)+λBCE(βˆi,βi), (3.10)where dˆi, j and di, j are the prediction and ground-truth grounding decision for ithphrase and jth region respectively, meanwhile, βˆi and βi are the prediction andground-truth on whether ith phrase is grounded to multiple regions or not; λ is ahyperparameter that is tuned using grid search.19Chapter 4Experiments4.1 Setup and InferenceWe use Faster R-CNN [30] with the VGG-16 backbone as a mechanism for ex-tracting proposal regions from the images. We treat those image regions (i.e., ,bounding-boxes) proposed by RPN as positive labels during training which havean IoU of more than 0.7 with the ground-truth boxes annotations of the dataset. Forphrases where no such box exists, we reduce the threshold to 0.5. We sample threenegative boxes for every positive during training. This ensures that the learnedmodel is not biased towards negatives.During inference, we feed all the proposal image regions to the model andmake two predictions. The first prediction is for each phrase, which is to determinewhether the phrase should be grounded to a single or multiple image regions. Thesecond prediction is for each phrase-region pair, which is to determine the proba-bility of grounding the given phrase to the given image region. Based on the firstprediction, results of the second prediction are accordingly post-processed, and thephrase is grounded to a single or multiple image regions.4.2 Datasets and EvaluationWe validate our model on the Flickr30k [27] and the Referit Game [17] datasets.Flickr30k contains 31,783 images where each image is annotated with five cap-20Method AccuracySMPL [35] 42.08NonlinearSP [33] 43.89GroundeR [31] 47.81MCB [13] 48.69RtP [27] 50.89Similarity Network [34] 51.05IGOP [43] 53.97SPC+PPC [28] 55.49SS+QRN (VGGdet) [8] 55.99CITE [29] 59.27SeqGROUND [12] 61.60CITE [29] (finetuned) 61.89QRC Net [8] (finetuned) 65.14G3RAPHGROUND++ 66.93Table 4.1: State-of-the-art comparison on Flickr30k. Phrase grounding accu-racy on the test set reported in percentages.tions/sentences. Each caption is further parsed into phrases, and the correspondingbounding-box annotations are available. A phrase may be annotated with morethan one ground-truth bounding-box, and a bounding-box may be annotated tomore than one phrase. We use the same dataset split as previous works [27, 29]which use 29,783 images for training, 1000 for validation, and 1000 for testing.The Referit Game dataset contains 20,000 images and we use the same splitas used in [15, 29] where we use 10,000 images for training and validation whileother 10,000 for testing. Each image is annotated with multiple referring expres-sions (phrases) and corresponding bounding-boxes. We note that the phrases cor-responding to a given image of this dataset do not come from a sentence but existindependently.Consistent with the prior work [31], we use grounding accuracy as the eval-uation metric which is the ratio of correctly grounded phrases to total number ofphrases in the test set. If a phrase is grounded to multiple boxes, we first takethe union of the predicted boxes over the image plane. The phrase is correctly21grounded if the predicted region has IoU of more than 0.5 with the ground-truth.4.3 Results and ComparisonMethod C1 C2 C3 C4 C5 C6 C7 C8SMPL [35] 57.89 34.61 15.87 55.98 52.25 23.46 34.22 26.23GroundeR [31] 61.00 38.12 10.33 62.55 68.75 36.42 58.18 29.08RtP [27] 64.73 46.88 17.21 65.83 68.72 37.65 51.39 31.77IGOP [43] 68.71 56.83 19.50 70.07 73.75 39.50 60.38 32.45SPC+PPC [28] 71.69 50.95 25.24 76.23 66.50 35.80 51.51 35.98SeqGROUND [12] 76.02 56.94 26.18 75.56 66.00 39.36 68.69 40.60CITE [29]* 75.95 58.50 30.78 77.03 79.25 48.15 58.78 43.24QRC Net [8]* 76.32 59.58 25.24 80.50 78.25 50.62 67.12 43.60GG++ 78.86 68.34 39.80 81.38 76.58 42.35 68.82 45.08Table 4.2: Phrase grounding accuracy comparison over coarse categories onFlickr30k dataset. Refer table 4.3 for the category names. The modelswith (*) as suffix are finetuned.Category Id Category NameC1 peopleC2 clothingC3 body partsC4 animalsC5 vehiclesC6 instrumentsC7 sceneC8 otherTable 4.3: Category mappings from ids to names.Flickr30k. We test our model on the Flickr30k dataset and report our results in Ta-ble 4.1. Our full model G3RAPHGROUND++ surpasses all other work by achievingthe best accuracy of 66.93%. The model achieves 5.33% increase in the ground-ing accuracy over the state-of-the-art performance of SeqGROUND [12]. Mostmethods, as do we, do not finetune the features on the target dataset. Exceptions22Method AccuracySCRC [15] 17.93MCB + Reg + Spatial [7] 26.54GroundeR + Spatial [31] 26.93Similarity Network + Spatial [34] 31.26CGRE [24] 31.85MNN + Reg + Spatial [7] 32.21EB+QRN (VGGcls-SPAT) [8] 32.21CITE [29] 34.13IGOP [43] 34.70QRC Net [8] (finetuned) 44.07G3RAPHGROUND++ 44.91Table 4.4: State-of-the-art comparison on ReferIt Game. Phrase groundingaccuracy on the test set reported in percentages.include CITE [29] and QRC Net [8] designated as (finetuned) in the table. We high-light that comparison to those methods isn’t strictly fair as they use the Flickr30kdataset itself to finetune feature extractors. Despite this, we outperform them by5% and 1.8% respectively, without utilizing specialized feature extractors. Whencompared to the versions of these models (CITE and SS+QRN (VGGdet)) that arenot finetuned, our model outperform them by 7.7% and 10.9% respectively. Thishighlights the power of our contextual reasoning in G3RAPHGROUND. Finetuningof features is likely to lead to additional improvements.Table 4.2 shows the phrase grounding performance of the models for differ-ent coarse categories in Flickr30k dataset. we observe that G3RAPHGROUND++achieves a consistent increase in accuracy compared to other methods in all ofthe categories except for the “instruments”; in fact our model performs best in sixout of the eight categories even when compared with the finetuned methods like[8, 29]. Improvement in the accuracy for “clothing” and “body parts” categories ismore than 8% and 9% respectively.ReferIt Game. We report results of our model on the ReferIt Game dataset in Table4.4. G3RAPHGROUND++ clearly outperforms all other state-of-the-art techniques23Figure 4.1: Sample attention results for visual graph. Aggregated attentionover each image region projected in an image.and achieves the best accuracy of 44.91%. Our model improves the groundingaccuracy by 10.21% over the state-of-the-art IGOP [43] model that uses similarfeatures.4.4 Qualitative ResultsIn Figure 4.1 we visualize the attention (α) on the nodes (image regions) of thevisual graph (image). We find that the model is able to differentiate the importantimage regions from the rest, for example, in (a), the model assigns higher attentionweights to important foreground objects such as child and man than the backgroundobjects like wall and pillar. Similarly in (d), the woman and the car get moreattention than any other region in the image.We visualize some phrase grounding results on the Flickr30k Entities dataset24Figure 4.2: Sample results obtained by G3RAPHGROUND on Flickr30k Enti-ties dataset. The colored bounding-boxes correspond to the phrases insame color.Figure 4.3: Sample results obtained by G3RAPHGROUND on ReferIt Gamedataset. The colored bounding-boxes correspond to the phrases in samecolor.in Figure 4.2. We find that our model is successful in grounding phrases for chal-lenging scenarios. In (e) the model is able to distinguish two women from otherwomen and is also able to infer that colorful clothing corresponds to the dress oftwo women not other women. In (b), (d) and (e) our model is able to ground a single25Method Flickr30k ReferIt GameGG - PhraseG 60.82 38.12GG - VisualG 62.23 38.82GG - FusionG 59.13 36.54GG - VisualG - FusionG 56.32 32.89GG - ImageContext 62.32 40.92GG - CaptionContext 62.73 n.a.GGFusionBase 60.41 38.65G3RAPHGROUND (GG) 63.87 41.79G3RAPHGROUND++ 66.93 44.91Table 4.5: Ablation results. Flickr30k and ReferIt Game datasets.phrase to multiple corresponding bounding-boxes. Also note the correct groundingof hand in (h) despite the presence of the other hand candidate. We also point outfew mistakes, for example in (h), blue Bic pen is incorrectly grounded to a braceletwhich is spatially close. In (g), curly hair is grounded to a larger bounding-box.We also visualize phrase grounding results on ReferIt Game dataset in Figure 4.3.4.5 AblationWe conduct ablation studies on our model to clearly understand the benefits of eachcomponent. Table 4.5 shows the results on both datasets. G3RAPHGROUND++ isour full model which achieves the best accuracy. G3RAPHGROUND lacks the sep-arate count prediction branch, and therefore post processes all the predictions ofthe network using the threshold mechanism. The model GG-PhraseG lacks thephrase graph to share information across the phrases, and directly uses the outputof the phrase encoder during the fusion step. In a similar approach, the model GG-VisualG lacks the visual graph, i.e., , there occurs no message-passing among pro-posal image regions. The output of the visual encoder is directly used during the fu-sion. The model GG-FusionG lacks the fusion graph, i.e., , the prediction networkmakes the predictions directly from the output of the visual graph concatenatedwith the output of the phrase graph. GG-VisualG-FusionG is missing both the vi-sual graph and the fusion graph. GG-ImageContext and GG-CaptionContext do26not use the full image and caption embedding respectively in the context informa-tion. We design another strong baseline GGFusionBase for G3RAPHGROUND tovalidate our fusion graph. In this method we do not instantiate one fusion graph oneach phrase for conditional massage-passing, but instead perform fusion throughmessage-passing on a single big graph that consists of the updated nodes of both,the phrase graph and the visual graph, such that each phrase node is connected toeach image region node with an edge of unit weight; no edges between the nodesof the same modality exist.We find that the results show consistent patterns in both of the datasets. Theworse performance of GG-PhraseG and GG-VisualG as compared to the versionG3RAPHGROUND confirms the importance of capturing intra-modal relationships.GG-VisualG-FusionG performs worst for both of the datasets. Even when ei-ther one of the visual graph or the fusion graph is present, accuracy is signifi-cantly boosted. However, the fusion graph is the most critical individual com-ponent of our model as its absence causes the maximum drop in accuracy. GG-FusionBase is slightly better than GG-FusionG but still significantly worse thanG3RAPHGROUND. This is strong proof of the efficacy of our fusion graph. Therole of our post processing technique is also evident from the performance gap be-tween G3RAPHGROUND and G3RAPHGROUND++. Since each ablated model per-forms significantly worse than the combined model, we conclude that each moduleis important.4.6 Multi-box evaluationWe also provide a new (stricter) metric for the box-level accuracy to evaluate multi-box predictions. We call the phrase correctly grounded if it meets two conditions:1) every box in the ground truth for the phrase has an IOU of more than 0.5 withat least one box among those that are matched to the phrase by the model. 2)Every box among those matched to the phrase by the model has an IOU of greaterthan 0.5 with at least one box from the ground truth for the phrase. We report thismetric separately for phrases with single (n = 1) and multiple (n > 1) ground truthannotations in Table 4.6. We also report this metric for pickTop1 version of ourmodel that grounds every phrase to only one box with the maximum prediction27score.Method Acc (n = 1) Acc (n > 1) mean AccpickTop1 69.03 4.80 56.12GG 53.17 25.78 48.08GG++ 67.46 25.61 59.07Table 4.6: Box level accuracy on Flickr30k Entities dataset.Dataset k = 1 k = 2 k = 3 k = 5Flickr30k Entities 65.51 66.93 67.21 67.64ReferIt Game 43.31 44.91 45.13 45.79Table 4.7: Effect of k on the accuracy of G3RAPHGROUND++.4.7 HyperparametersThe threshold for post-processing is set to 0.55 and λ to 0.1 using cross-validation.We use the Adam optimizer with learning rate of 0.0002 and decay rate of 0.8every 4 epochs. We also use the weight decay of 0.0005. For the effect of numberof iterations k see Table 4.7.28Chapter 5ConclusionIn this work, we proposed the G3RAPHGROUND framework that deploys GatedGraph Neural Networks to capture intra-modal and cross-modal relationships be-tween the phrases and the image regions to perform the task of language grounding.Image regions are extracted from the images using the pretrained Region ProposalNetwork. G3RAPHGROUND encodes the phrases into the phrase graph and imageregions into the visual graph to finally fuse them into the fusion graph using con-ditional message-passing. This allows the model to jointly make predictions for allphrase-region pairs without making any assumption about the underlying structureof the data. The effectiveness of our approach is demonstrated on two benchmarkdatasets, with up to 10% improvement on state-of-the-art.In the future, we would like to extend this framework for temporal and spatiallocalization of the descriptions of activities in videos. Videos can be modelledas space-time graphs by linking the spatial graphs obtained from the respectiveframes across time. In this scenario, the adjacency matrix would capture spatial andsemantic similarities among the image regions across the frames. This would allowus to capture and visualize interactions among objects across space and time, andhence provide good insight behind the model’s predictions for the task of temporallocalization.Another thread of future work would involve improving the performance ofthe current work. We would like to try multi-kernel approach and see if it furtherimproves the performance. Currently, each graph in the model uses single kernel29for the message-passing step. In future, we would like to use a set of kernels, wherethe choice of kernel for message passing is made with respect to a particular captionusing the the activity information (verb) mentioned in the caption. For instance,“walking” and “running” are more similar in terms of interaction as comparedto “walking” and “holding”, hence it seems more sensible to share a kernel for“walking” and “running” but not for “walking” and “holding”. We can model thisby using a separate branch that learns to choose the kernels based on the contentof the caption. This is inspired by the idea of concept-weight branch used in [29].Besides this, finetuning the Region Proposal Network on the target datasets as donein [8] is likely to improve the performance of our current model.We would also like to see if the current model can be modified and applied suc-cessfully on the inverse task of image and video captioning which would requireincremental graph construction from the output produced by any given time, andthen using message passing and fusion process to generate further output. We ac-knowledge that there is also a need for more experiments to see if the current modelgeneralizes well to other localization tasks that involve different combinations ofmodalities other than text and vision, for instance audio and vision.Lastly, the current model can also be enhanced further to handle more than twomodalities which would be useful for tasks such as localization of text in videosthat are supplemented by synchronized audio. All three modalities can be modelledas separate graphs and then fused in a similar way to get the final predictions.This would require a smarter and efficient approach of fusion to cope-up with theincrease in computation arising from the exponential increase in the fusion graphsin the fusion step due to the increase in number of modalities.30Bibliography[1] N. Aafaq, S. Z. Gilani, W. Liu, and A. Mian. Video description: A survey ofmethods, datasets and evaluation metrics. CoRR, abs/1806.00186, 2018. →pages 7[2] A. Agrawal, D. Batra, D. Parikh, and A. Kembhavi. Don’t just assume; lookand answer: Overcoming priors for visual question answering. In IEEEConference on Computer Vision and Pattern Recognition (CVPR), 2018. →pages 1[3] R. Arandjelovic and A. Zisserman. Objects that sound. In EuropeanConference on Computer Vision (ECCV), 2018. → pages 3[4] P. K. Atrey, M. A. Hossain, A. El Saddik, and M. S. Kankanhalli.Multimodal fusion for multimedia analysis: a survey. Multimedia systems,16(6):345–379, 2010. → pages 7[5] L. Beinborn, T. Botschen, and I. Gurevych. Multimodal grounding forlanguage processing. In Proceedings of the 27th International Conferenceon Computational Linguistics, pages 2325–2339. Association forComputational Linguistics, 2018. URLhttp://aclweb.org/anthology/C18-1197. → pages 7[6] W. Bridewell and P. F. Bello. A theory of attention for cognitive systems.Advances in Cognitive Systems, 4:1–16, 2016. → pages 8[7] K. Chen, R. Kovvuri, J. Gao, and R. Nevatia. Msrc: Multimodal spatialregression with semantic context for phrase grounding. In ACM onInternational Conference on Multimedia Retrieval, pages 23–31, 2017. →pages 23[8] K. Chen, R. Kovvuri, and R. Nevatia. Query-guided regression network withcontext policy for phrase grounding. In IEEE International Conference onComputer Vision (ICCV), pages 824–832, 2017. → pages 2, 9, 21, 22, 23, 3031[9] X. Chen, L.-J. Li, L. Fei-Fei, and A. Gupta. Iterative visual reasoningbeyond convolutions. In IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2018. → pages 11[10] K. Cho, B. Van Merrie¨nboer, C. Gulcehre, D. Bahdanau, F. Bougares,H. Schwenk, and Y. Bengio. Learning phrase representations using rnnencoder-decoder for statistical machine translation. arXiv preprintarXiv:1406.1078, 2014. → pages 11[11] C. Deng, Q. Wu, Q. Wu, F. Hu, F. Lyu, and M. Tan. Visual grounding viaaccumulated attention. In IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2018. → pages 1[12] P. Dogan, L. Sigal, and M. Gross. Neural sequential phrase grounding(seqground). In IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2019. → pages 2, 3, 9, 21, 22[13] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach.Multimodal compact bilinear pooling for visual question answering andvisual grounding. Conference on Empirical Methods in Natural LanguageProcessing (EMNLP), 2016. → pages 21[14] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchiesfor accurate object detection and semantic segmentation. In IEEEConference on Computer Vision and Pattern Recognition (CVPR), 2014. →pages 1[15] R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Darrell. Naturallanguage object retrieval. In IEEE Conference on Computer Vision andPattern Recognition (CVPR), pages 4555–4564, 2016. → pages 5, 21, 23[16] A. Karpathy, A. Joulin, and L. Fei-Fei. Deep fragment embeddings forbidirectional image sentence mapping. In Advances in Neural InformationProcessing Systems (NeurIPS), pages 1889–1897, 2014. → pages 1, 8[17] S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg. Referitgame:Referring to objects in photographs of natural scenes. In Conference onEmpirical Methods in Natural Language Processing (EMNLP), pages787–798, 2014. → pages 1, 4, 5, 12, 20[18] T. N. Kipf and M. Welling. Semi-supervised classification with graphconvolutional networks. In International Conference on LearningRepresentations (ICLR), 2017. → pages 3, 1032[19] R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semanticembeddings with multimodal neural language models. In Transactions ofthe Association for Computational Linguistics (TACL), 2015. → pages 3[20] K. Kucher and A. Kerren. Text visualization techniques: Taxonomy, visualsurvey, and community insights. In 2015 IEEE Pacific VisualizationSymposium (PacificVis), pages 117–121. IEEE, 2015. → pages 8[21] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel. Gated graph sequenceneural networks. In International Conference on Learning Representations(ICLR), 2016. → pages 3, 11, 15, 16[22] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar,and L. Zitnick. Microsoft coco: Common objects in context. In EuropeanConference on Computer Vision (ECCV), pages 740–755, 2014. → pages 1[23] J. Lu, J. Yang, D. Batra, and D. Parikh. Neural baby talk. In IEEEConference on Computer Vision and Pattern Recognition (CVPR), 2018. →pages 1[24] R. Luo and G. Shakhnarovich. Comprehension-guided referringexpressions. In IEEE Conference on Computer Vision and PatternRecognition (CVPR), pages 7102–7111, 2017. → pages 1, 23[25] M. Malinowski, M. Rohrbach, and M. Fritz. Ask your neurons: Aneural-based approach to answering questions about images. In Proceedingsof the IEEE international conference on computer vision, pages 1–9, 2015.→ pages 8[26] J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for wordrepresentation. In Conference on Empirical Methods in Natural LanguageProcessing (EMNLP), pages 1532–1543, 2014. → pages 13[27] B. Plummer, L. Wang, C. Cervantes, J. Caicedo, J. Hockenmaier, andS. Lazebnik. Flickr30k entities: Collecting region-to-phrasecorrespondences for richer image-to-sentence models. In IEEE InternationalConference on Computer Vision (ICCV), pages 2641–2649, 2015. → pages1, 4, 5, 12, 20, 21, 22[28] B. Plummer, A. Mallya, C. Cervantes, J. Hockenmaier, and S. Lazebnik.Phrase localization and visual relationship detection with comprehensiveimage-language cues. In IEEE International Conference on ComputerVision (ICCV), pages 1928–1937, 2017. → pages 9, 21, 2233[29] B. Plummer, P. Kordas, H. Kiapour, S. Zheng, R. Piramuthu, andS. Lazebnik. Conditional image-text embedding networks. In EuropeanConference on Computer Vision (ECCV), pages 249–264, 2018. → pages 1,5, 9, 21, 22, 23, 30[30] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-timeobject detection with region proposal networks. In Advances in NeuralInformation Processing Systems (NeurIPS), pages 91–99, 2015. → pages 1,20[31] A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele. Grounding oftextual phrases in images by reconstruction. In European Conference onComputer Vision (ECCV), pages 817–834, 2016. → pages 21, 22, 23[32] E. Shelhamer, J. Long, and T. Darrell. Fully convolutional networks forsemantic segmentation. IEEE Transactions on Pattern Analysis andMachine Intelligence (TPAMI), 39(4), 2017. → pages 1[33] L. Wang, Y. Li, and S. Lazebnik. Learning deep structure-preservingimage-text embeddings. In IEEE Conference on Computer Vision andPattern Recognition (CVPR), pages 5005–5013, 2016. → pages 1, 9, 21[34] L. Wang, Y. Li, J. Huang, and S. Lazebnik. Learning two-branch neuralnetworks for image-text matching tasks. IEEE Transactions on PatternAnalysis and Machine Intelligence (TPAMI), 41(2):394–407, 2019. → pages9, 21, 23[35] M. Wang, M. Azab, N. Kojima, R. Mihalcea, and J. Deng. Structuredmatching for phrase localization. In European Conference on ComputerVision (ECCV), pages 696–711, 2016. → pages 9, 21, 22[36] X. Wang and A. Gupta. Videos as space-time region graphs. In EuropeanConference on Computer Vision (ECCV), pages 399–417, 2018. → pages 10[37] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu. A comprehensivesurvey on graph neural networks. CoRR, abs/1901.00596, 2019. URLhttp://arxiv.org/abs/1901.00596. → pages 7[38] F. Xiao, L. Sigal, and Y. J. Lee. Weakly-supervised visual grounding ofphrases with linguistic structures. In IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 2017. → pages 134[39] H. Xu, K. He, B. Plummer, L. Sigal, S. Sclaroff, and K. Saenko. Multilevellanguage and vision integration for text-to-clip retrieval. In AAAIConference on Artificial Intelligence (AAAI), 2019. → pages 3[40] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel,and Y. Bengio. Show, attend and tell: Neural image caption generation withvisual attention. In International conference on machine learning, pages2048–2057, 2015. → pages 8[41] K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How powerful are graph neuralnetworks? In International Conference on Learning Representations(ICLR), 2019. → pages 3, 11[42] J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh. Graph r-cnn for scene graphgeneration. In European Conference on Computer Vision (ECCV), pages670–685, 2018. → pages 10[43] R. Yeh, J. Xiong, W.-M. Hwu, M. Do, and A. Schwing. Interpretable andglobally optimal prediction for textual grounding using image concepts. InAdvances in Neural Information Processing Systems (NeurIPS), pages1912–1922, 2017. → pages 21, 22, 23, 24[44] Y. Zhang and H. Lu. Deep cross-modal projection learning for image-textmatching. In Proceedings of the European Conference on Computer Vision(ECCV), pages 686–701, 2018. → pages 3[45] Y. Zhang, J. C. Niebles, and A. Soto. Interpretable visual question answeringby visual grounding from attention supervision mining. → pages 1[46] J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, and M. Sun. Graph neuralnetworks: A review of methods and applications. CoRR, abs/1812.08434,2018. URL http://arxiv.org/abs/1812.08434. → pages 735

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0380482/manifest

Comment

Related Items