UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Visual grounding through iterative refinement Fan, Zicong 2020

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


24-ubc_2020_november_fan_zicong.pdf [ 9.78MB ]
JSON: 24-1.0391964.json
JSON-LD: 24-1.0391964-ld.json
RDF/XML (Pretty): 24-1.0391964-rdf.xml
RDF/JSON: 24-1.0391964-rdf.json
Turtle: 24-1.0391964-turtle.txt
N-Triples: 24-1.0391964-rdf-ntriples.txt
Original Record: 24-1.0391964-source.json
Full Text

Full Text

VISUAL GROUNDING THROUGH ITERATIVEREFINEMENTbyZicong FanB.Sc., The University of British Columbia, 2018A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University of British Columbia(Vancouver)June 2020c© Zicong Fan, 2020The following individuals certify that they have read, and recommend to the Fac-ulty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:VISUAL GROUNDING THROUGH ITERATIVE REFINEMENTsubmitted by Zicong Fan in partial fulfillment of the requirements for the degreeof Master of Science in Computer Science.Examining Committee:Leonid Sigal, Computer ScienceCo-supervisorJames J. Little, Computer ScienceCo-supervisorHelge Rhodin, Computer ScienceAdditional ExamineriiAbstractThe problem of visual grounding has attracted much attention in recent years due toits pivotal role in more general visio-lingual high level reasoning tasks (e.g., imagecaptioning, VQA). Despite the tremendous progress in this area, the performanceof most approaches has been hindered by the precision of bounding box propos-als obtained in the early stages of the recent pipelines. To address this limitation,we propose a general progressive query-guided bounding box refinement architec-ture (OptiBox) that regresses the output of a visual grounding system closer to theground truth. We apply this architecture in the context of the GroundeR modeland the One-Stage Grounding model. The results from the GroundeR model showthat our model can provide an additional grounding accuracy gain for a two-stagegrounding system. Further, our experiments show that the proposed model can sig-nificantly improve bounding box precision when the predicted box of a groundingsystem deviates from the ground truth.iiiLay SummaryVisual grounding is a task of localizing an object within an image in the form of abounding box based on a natural language query. The task is challenging becausean enormous number of boxes can be drawn within an image for a given query.Most recent methods consider the results to be sufficient as long as the boxes arenearby the target objects. We argue that an ideal grounding system should bothlocate the correct object and establish a tight boundary around the object. In thisthesis, we propose a model that leverages the input query to refine the boxes so thatthe boxes are tight around the target objects while being no less accurate than theoriginal grounding results. In particular, we formulate the refinement progress asan iterative procedure where a box is refined progressively until it converges to anoptimal placement.ivPrefaceThis thesis presents the original work done by the author, Zicong Fan, under the su-pervision of Dr. Leonid Sigal and Dr. James J. Little. This project originates froma course project in collaboration with Si Yi Meng to improve the GroundeR [47]model. The author adapted the GroundeR code by Si Yi Meng in the GroundeRexperiment in Section 4.3. Except for the adapted code in the aforementioned ex-periment, the design, implementation, and experiments for this model were doneby the author. The author drafted the initial thesis, which was later revised by Dr.Leonid Sigal and Dr. James J. Little.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1 Cross-Modal Learning from Vision and Language . . . . . . . . . 42.2 Visual Grounding . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Bounding Box Regression . . . . . . . . . . . . . . . . . . . . . 93 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.1 One-Step Refinement . . . . . . . . . . . . . . . . . . . . . . . . 113.2 Multi-Step Refinement . . . . . . . . . . . . . . . . . . . . . . . 133.3 Integrating OptiBox into the State of the Art . . . . . . . . . . . . 133.3.1 GroundeR . . . . . . . . . . . . . . . . . . . . . . . . . . 14vi3.3.2 One-Stage Grounding . . . . . . . . . . . . . . . . . . . 154 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.2 Denoising Experiment . . . . . . . . . . . . . . . . . . . . . . . 174.2.1 Experiment Description . . . . . . . . . . . . . . . . . . 174.2.2 Iterative Refinement . . . . . . . . . . . . . . . . . . . . 184.2.3 Visual and Language Backbones . . . . . . . . . . . . . . 214.2.4 Changes in IoU Distribution . . . . . . . . . . . . . . . . 224.2.5 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . 244.3 OptiBox with the State-of-the-Art Models . . . . . . . . . . . . . 254.3.1 Grounding Accuracy Performance . . . . . . . . . . . . . 264.3.2 IoU Distribution Analysis . . . . . . . . . . . . . . . . . 274.4 Discussion and Limitations . . . . . . . . . . . . . . . . . . . . . 294.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . 305 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34viiList of TablesTable 4.1 Accuracy of one-step and multi-step OptiBox when trained indifferent noise levels and different number of refinement steps.Note that zero-step corresponds to the accuracy before doingrefinement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19Table 4.2 Accuracy of a multi-step OptiBox using different visual back-bones in different noise level settings. . . . . . . . . . . . . . . 21Table 4.3 Accuracy of an eight-Step OptiBox using different language en-coders in different noise level settings. . . . . . . . . . . . . . 22Table 4.4 Performance comparison of the state-of-the-art models on theFlickr30k Entities dataset. . . . . . . . . . . . . . . . . . . . . 26Table 4.5 Comparison of proposal upper bounds of various state-of-the-art models with our models. In the QRN network, RPN, SS andPGN are different proposal methods. . . . . . . . . . . . . . . 27viiiList of FiguresFigure 1.1 Two visual grounding examples . . . . . . . . . . . . . . . . 1Figure 1.2 Left: the white box represents the raw prediction of a ground-ing model (GroundeR [47] in this case) for the phrase A child.Right: our model (OptiBox) makes appropriate adjustmentsusing query-guided regression, resulting in a much more pre-cise box on the right. The blue box represents the ground truth. 3Figure 2.1 Vision and language research categories. . . . . . . . . . . . 5Figure 3.1 OptiBox: Our proposed bounding box refinement model. Givenan input bounding box, an image, and a phrase query, we en-code the image and query using an image encoder and a lan-guage encoder. We extract visual features using RoIAlign ac-cording to the input box region to obtain a feature vector x.The query features q are then concatenated with the visual fea-ture vector for refinement. We project the concatenated vec-tor using feedforward layers. The final projection yields a 4-dimensional bounding box adjustment. The numbers on thelayers indicate the corresponding layer output dimensions. . . 11Figure 3.2 Multi-step OptiBox. The one-step OptiBox model is appliediteratively k times to adjust the bounding box. During the ksteps of refinement, an ideal multi-step model would graduallymove the input bounding box (shown in orange) towards theground truth box (shown in blue). . . . . . . . . . . . . . . . 12ixFigure 3.3 An illustration of the GroundeR [47] model. Weight-sharedlinear layers are in the same color. . . . . . . . . . . . . . . . 14Figure 3.4 An illustration of the One-Stage Grounding [57] model. Thefigure is taken from the original paper. . . . . . . . . . . . . . 15Figure 4.1 One example from the Flickr30k Entities [38] dataset . . . . . 17Figure 4.2 Accuracy at each refinement step of an eight-step OptiBoxmodel when trained in different noise levels. . . . . . . . . . . 19Figure 4.3 Contour plot of IoU changes for lower noise levels . . . . . . 22Figure 4.4 Contour plot of IoU changes for higher noise levels . . . . . . 23Figure 4.5 Positive qualitative examples of OptiBox refinement. . . . . . 24Figure 4.6 A comparison of different IoU distributions . . . . . . . . . . 27Figure 4.7 Contour plot of IoU changes before and after the refinementby a One-Step OptiBox model for the GroundeR [47] modeland the One-stage Grounding [57] model. The percentages ofmedian IoU improvement (non-zero IoUs) for the two modelsare 10.83% and 0.15% . . . . . . . . . . . . . . . . . . . . . 28Figure 4.8 Three main cases where OptiBox fails. . . . . . . . . . . . . . 29xAcknowledgmentsThis research was supported by the Natural Sciences and Engineering ResearchCouncil of Canada (NSERC) through the Canada Graduate Scholarships-Master’s(CGS-M) Program.I would like to express my gratitude toward my supervisors Dr. Leonid Sigaland Dr. James J. Little. Their professionalism has always been an inspiration tome. I appreciate their support even in difficult times during my research and theexceptional academic freedom that they provided to me. I would like to thankmy past supervisor Dr. Ian Mitchell for encouraging me in research during myundergraduate and early graduate studies, and for inspiring me to pursue academicresearch. I acknowledge my colleagues Shih-Han Chou, Bicheng Xu, and MirRayat Imtiaz Hossain for helpful feedback and discussions.Last but not least, I thank my friends and family members for their supports,especially my parents for their patience and unconditional love. Special thanks toEileen Zeng for her encouragement and wonderful cookies during the hard times,and Jocelyn Minns for the countless walks to the ice cream parlour.xiChapter 1IntroductionVisual grounding is the task of associating a textual query with corresponding re-gions in a given image. Given a query, an ideal visual grounding system is expectedto localize an entity within the image according to the query in the form of a bound-ing box. The convention of most methods in the area is that if a phrase correspondsto multiple regions in an image the union of the regions would be the target. Figure1.1 shows two examples of grounding different queries. Given the phrase “a fish”,an ideal grounding model should localize the fish by returning a bounding box sur-rounding it. For the phrase “person”, since there are two “person” entities in thescene, the query should correspond to the bounding box containing both persons.Figure 1.1: Two visual grounding examplesThe visual grounding problem has attracted much attention in recent years as itplays a vital role in applications such as image captioning [24] and visual question1answering (VQA) [61]. Most methods in this field follow a two-stage process[9, 22, 34, 39, 47] consisting of an object proposal stage that suggests potentialbounding boxes a phrase query could ground to, and a decision stage that assignsone or more proposed boxes to a query. Despite various efforts to improve visualgrounding systems, their performance is limited by the precision (i.e., the overlapbetween a box and its ground truth) of bounding box proposals in the first stage.That is, if most bounding box proposals have low overlap with the ground truth(low precision, the left of Fig. 1.2 for instance), a model will struggle in the secondstage. We say that a query is correctly grounded if it is assigned a proposal boxthat is close enough in size and location to the ground truth annotation box.Although the grounding performance is limited by the precision of boundingbox proposals, few methods in visual grounding attempt to address this issue byimproving the overlap between bounding box proposals and their ground truth re-gions. Chen et al. [6] introduce a bounding box regression network that leveragesreinforcement learning techniques to guide the training of the network. However,their model is not expressive enough for bounding box refinement. Although Yanget al. [57] propose to incorporate query information in the region proposal networkto predict bounding boxes that are more correlated with the query, they do notrefine the predicted boxes to improve bounding box precision.In fact, since visual grounding accuracy — the fraction of grounded regionsclose to the ground truth bounding boxes — is the most commonly used metricfor evaluating visual grounding systems, research in this area tends to neglect theprecision of the predicted bounding boxes. In this thesis, we would like to goone step further by considering both visual grounding accuracy and bounding boxprecision. To this end, we introduce a bounding box optimization network calledOptiBox that can be used to improve bounding box precision of existing visualgrounding models. In a nutshell, OptiBox utilizes the textual query to refine thepredicted proposal boxes progressively and ensures that the box is tight around theobject of interest (Figure 1.2 right).We conduct a controlled experiment to demonstrate a few desirable propertiesof our proposed model. Further, to evaluate how our proposed model can bene-fit existing visual grounding systems, we incorporate OptiBox into two publishedmodels: the GroundeR [47] model and the One-Stage Grounding [57] model. We2Figure 1.2: Left: the white box represents the raw prediction of a groundingmodel (GroundeR [47] in this case) for the phrase A child. Right: ourmodel (OptiBox) makes appropriate adjustments using query-guided re-gression, resulting in a much more precise box on the right. The bluebox represents the ground truth.choose the GroundeR model [47] because it has a simple structure and it has beenwell-studied in the literature. We choose the One-Stage Grounding model [57] toillustrate the benefit of OptiBox towards the current state of the art.Our contributions are as follows: (1) We propose a general progressive query-guided bounding box refinement architecture (OptiBox) that dramatically improvesthe precision of bounding boxes in the context of visual grounding; (2) We conductvarious experiments to showcase the desirable qualities of OptiBox; (3) We incor-porate this architecture into two published visual grounding models (the GroundeRmodel [47] and the One-Stage Grounding model [57]) to demonstrate the benefitof OptiBox for state-of-the-art models.3Chapter 2Related WorkVisual grounding is a cornerstone of machine intelligence. Because the problemis highly related to Vision and Language research, we provide a brief overview ofmachine learning across the two modalities. Since our focus is to improve visualgrounding precision by a refinement process, for added context, we provide a recenthistory of the visual grounding and bounding box regression methods.2.1 Cross-Modal Learning from Vision and LanguageOver the last few years, the deep learning community begin to blur the line betweenthe vision and the language domain due to numerous applications of the cross-modal vision and language research [35]. Most vision and language research fallsinto one of the following categories (also see Fig. 2.1):Vision to Language (V2L): This category of methods focuses on mappingfrom the vision to the language domain, which includes captioning of images [1,58] and videos [11, 36]. For example, Anderson et al. [1] introduced a bottom-upand top-down attention approach for Image Captioning. They obtain a set of visualfeatures from a Faster R-CNN [44] backbone, which is used by an LSTM decoder[20] to generate captions by attending to different visual features.Language to Vision (L2V): L2V methods, on the other hand, address prob-lems of the inverse mapping, including tasks such as image [43, 56, 60] or video[3, 29] generation from text. An example of image generation is the AttnGAN4model, proposed by Xu et al. [56]. The model generates images from captions ina multi-stage process. In particular, at each image generation stage, they use anattention mechanism to focus on different regions of the image generated in theprevious stage, and on different words in the caption.Image/Video Captioning Text to Image/Video GenerationVisual Question AnsweringLanguageVisionLanguageReferring ExpressionLanguageVisionVL2L VL2VVisionV2L L2VVision Language Language VisionFigure 2.1: Vision and language research categories.Vision-Language to Language (VL2L): In this class of methods, the goal is tofacilitate language understanding and reasoning via the interaction between the twomodalities. Visual Question Answering (VQA) [2, 4, 10, 12, 52] belongs to this cat-egory. For example, Peng et al. [12] proposed the Dynamic Fusion with Intra- andInter-modality Attention Flow (DFAF) module. The model begins by extractingvisual features and language features from image regions and words of the ques-tion. The visual features and the language features are then refined by a sequenceof DFAF modules. Each module allows the two modalities to interact by weightingfeatures from each of the two modalities using features from the other modality. Inparticular, the weighting process is achieved through a co-attention between theimage regions and the words of a question. After the weighting, each modality isindependently refined via self-attention. The output features from DFAF are thenused to predict a probability distribution over possible answers.Vision-Language to Vision (VL2V): In the VL2V branch, the two modalitiesare used for solving problems in visual understanding. For example, the ReferringExpression (or Visual Grounding) problem [25, 38] aims to find entities within animage based on a natural language query. Most methods in the literature such as [9,534, 47] extract image regions through a region proposal network (RPN) and obtainlanguage features from the input query. Visual features of the image regions arethen merged with the language features from the input query by a fusion module.The features from the fusion module are then mapped to scalars to represent thecompatibility between the query and each region and the most compatible regionis returned. In Section 2.2 we will dive into the details of visual grounding.Recent Trend: Most models in the vision and language literature focus on de-signing single-task architectures. Recently, there is a trend of vision and languageresearch in the multi-task learning setting [7, 32, 33, 51]. In analogous to pre-training a visual backbone in the vision domain, they pre-train a visual-linguisticbackbone using self-supervised objectives on a corpus with image and caption pairs(such as the Conceptual Caption dataset [49]). Once the pre-training is complete,the backbone is adapted to different vision and language tasks and is trained jointlywith multiple task objectives. For example, Su et al. [51] trained and evaluatedtheir visual-linguistic backbone on the Visual Commonsense Reasoning [59], Vi-sual Question Answering, and Referring Expression tasks. Recently, Lu et al. [33]extended the training to 12 vision and language tasks, which shows that trainingwith a multi-task objective can significantly reduce overfitting.One example of the visual-linguistic backbone is the ViLBERT model pro-posed by Lu et al. [32]. They extended the Transformer model [8, 53] in the lan-guage domain to the vision and language domain by introducing the Co-Transformermodule, which performs refinement on visual features (of image regions) and lan-guage features (of the captions) by conditioning on the other modality. In thepre-training stage, [32] pre-train the visual-linguistic backbone on the ConceptualCaption dataset by predicting the class probability distribution of a bounding boxfrom a region proposal network (RPN) based on visual and linguistic context suchas the visual features from other bounding boxes and the language features fromthe image caption. Another objective they introduced was to predict whether a textis describing an image by considering the original image-caption pair from the cor-pus a positive example and treating the pairs from the same image but with othercaptions as negative examples.62.2 Visual GroundingThe Flickr30k dataset was introduced by Plummer et al. [38] to set a benchmarkfor the visual grounding of phrases. The objective of the grounding task is to lo-calize an image region based on a natural language query in the form of a phrase.Most methods in the literature [9, 34, 47] rely on a region proposal network (RPN)to obtain candidate regions for the query to ground to. A compatibility score iscomputed for the query and each region. Although deep learning started to gainits popularity at the time of [38], the benchmark model was developed based onCanonical Component Analysis (CCA) [21]. The idea is to learn linear mappingsto a common space from the vision and language modalities by maximizing the cor-relation between the two modalities in that space. Under this formulation, a phraseis grounded to a region, if the region has the minimum cosine distance to the phrasein the learned common space. Later, Wang et al. [54] focused on a metric learningapproach while Hu et al. [22] started to incorporate deep learning architectures intoa phrase generation pipeline. In particular, Wang et al. [54] added non-linearities inthe common space projection functions and introduced the bidirectional cross-viewranking constraint and the within-view structure-preserving constraint. The formerconstraint encourages the distance between a phrase and its ground truth regionpair in the annotation to be smaller than other possible combinations across thedataset. The latter constraint specifies that the features of similar sentences shouldbe clustered together in the common space. The visual features of regions have asimilar dual requirement. Hu et al. [22] used a text generation deep learning archi-tecture that utilizes both spatial and visual features. Specifically, they introducedthe Spatial Context Recurrent ConvNet (SCRC) model which consists of a localLSTM stream and a global LSTM stream. The two streams jointly help to predictthe probability of a query by conditioning on the local visual features of an imageregion, the global visual features of the image, and the location of the region. Theauthors decomposed the joint probability of a query using a product of conditionalprobabilities of each word given previous words. Later, Rohrbach et al. [47] pro-posed the Grounding by Reconstruction (GroundeR) model that directly scores thecandidate regions based on the query without a text generation architecture (like in[22]). The GroundeR model can be trained under different levels of supervision.7Up to this point, most methods in the literature do not address the limitationof candidate proposals. That is, if most proposals for a given query are not closeto the target, it would be hard to ground a query to the correct region. Severalmethods have attempted to overcome this bottleneck. Chen et al. [6] proposedQuery-guided Regression network with Context policy (QRC Net). The QRC Netconsists of three parts: a Proposal Generation Network (PGN) for candidate pro-posals, a Query-guided Regression Network (QRN) for performing query-guidedbounding box regression on the proposals, and a Context Policy Network (CPN)for guiding the training of the regression network to avoid the network regress-ing to irrelevant regions. In contrast to our method, the QRC Net only performsbounding box regression once and our ablation studies show that more refinementsteps can provide a further grounding accuracy gain. The concurrent research fromYang et al. [57] and Sadhu et al. [48] tackle the proposal limitation in a differentapproach. Specifically, they fuse language features from the query into the imagefeature map to directly propose query-related bounding boxes. In particular, [57]adapts the YOLO [41, 42] architecture to directly predict the anchor offsets, andthe compatibility of the query to the anchored region.Although in this thesis we do not assume the input phrases are associated witha sentence, some methods leverage the context in the sentence. The Context PolicyNetwork in the aforementioned QRC Net [6] guides the training of the regressionnetwork by considering all phrases in a sentence to avoid regressing to irrelevantregions. Dogan et al. [9] proposed a network consisting of two stacks of LSTMcells to make all grounding decisions jointly. The phrase LSTM stack processes thelanguage vector for each phrase according to their order in the sentence. The boxLSTM stack processes the visual features of boxes according to the relative spatialpositions of boxes. After the queries and boxes are encoded, a history LSTM stackgrounds each phrase in a recurrent manner, which allows the grounding of a phraseconditional on the grounding of previous phrases. Bajaj et al. [34] extended theidea of making grounding decisions jointly by proposing a grounding model basedon a graph neural network structure, which relaxes the order assumption betweenphrases and bounding boxes in [9]. Specifically, they proposed a vision graph anda language graph. The vision graph considers the visual features of each boundingbox a node in the graph while the language graph considers the language features8of each phrase a node. Assuming the nodes are fully connected, the two graphsare refined through attention. A fusion graph is used to merge all combinations ofnodes between the two graphs to ground all phrases jointly.2.3 Bounding Box RegressionBounding box regression has been commonly used in object detection to improvethe precision of boxes associated with detected objects. Many object detectionmethods [14, 15, 17, 31, 41, 42, 44] adapt a ridge regression model to refine boxesfrom a region proposal network (RPN). For example, Lin et al. [31] introduced aYOLO-like one-stage object detector [41, 42] which performs bounding box re-gression at each anchor with a linear model using a feature pyramid representation.Some authors started to introduce more sophisticated architectures. For exam-ple, the network introduced by Gidaris et al. [13] iteratively refines bounding boxesfrom object detection according to the box category. Rajaram et al. [40] proposeda refinement strategy by iteratively pooling features on previous box predictionsfrom an image feature map. Roh and Lee [46] enhanced classification and local-ization performance of Faster R-CNN [44] by pooling the visual features from thedetected boxes and performing an extra bounding box regression and classificationstep. Jiant et al. [23] proposed an IoU-Net, which predicts the Intersection overUnion (IoU) between a box and its ground truth. The IoU-Net is used within anoptimization procedure for improving box precision by maximizing IoU objective.Since the models in [13, 23, 40, 46] are designed for object/vehicle detection inwhich no natural language query is provided, their models do not leverage a queryfor the refinement. Since natural language queries are always present in grounding,we build a custom regression model that leverages such information to guide therefinement. Lastly, the approach of [46] is similar to ours, but our method supportsan arbitrary number of refinement steps while theirs only supports one step.Some methods tackle the regression problem by reformulating the loss. Forexample, Rezatofighi et al. [45] proposed a generalized IoU metric that can beused as a loss, enabling a direct optimization on the IoU performance. He et al. [19]introduced a box regression loss that reflects both bounding box transformation andlocalization variance. The variance can be used to merge nearby bounding boxes.9Chapter 3ApproachIn this chapter, we introduce our approach for improving bounding box precision.The main idea of our method is that we start from an initial image bounding boxalong with a language query and we refine the box to better align with the groundtruth using the supplied query. This process requires reasoning about the visualcontent of the initial box and the query phrase itself. While one can potentiallythink of an oracle process that refines a box in a single step, we posit that this maynot be practical in scenarios where the initial box is too far from the ground truthregion. As such, we propose an iterative refinement mechanism, that effectivelydecomposes the refinement task as a series of bounding box refinement steps. Thisprocess is conceptually akin to a visual search in which human eye undergoes witha series of fixations in an image when trying to locate an object of interest.In Section 3.1 we define notation and introduce our one-step OptiBox model(the one-step oracle process). We extend the simple model in Section 3.2 to in-clude an iterative refinement process. We call this process multi-step OptiBoxmodel, which includes the one-step model as a special case. To investigate how(multi-step) OptiBox can benefit the state-of-the-art, in Section 3.3 we incorporateOptiBox into two published models.10Input Box: rq512x5ReLUCNNPHRASERefinement ModuleLanguage EncoderImage Encoderq512 4xt̂LSTMRoIAlignIMGOutput Box: r’Figure 3.1: OptiBox: Our proposed bounding box refinement model. Givenan input bounding box, an image, and a phrase query, we encode theimage and query using an image encoder and a language encoder. Weextract visual features using RoIAlign according to the input box regionto obtain a feature vector x. The query features q are then concatenatedwith the visual feature vector for refinement. We project the concate-nated vector using feedforward layers. The final projection yields a4-dimensional bounding box adjustment. The numbers on the layersindicate the corresponding layer output dimensions.3.1 One-Step RefinementFigure 3.1 depicts our one-step refinement network. Given an image, a phrasequery, and a bounding box r = [rx,ry,rw,rh] ∈ R4 (the xy-coordinate of the top-left corner and the width and height of the box) corresponding to the query, wefirst obtain the image feature map using an image encoder. A visual vector x isthen extracted according to the input bounding box r using RoIAlign [18] pooling.For the language side, we encode the phrase query using a language encoder (forexample, an LSTM [20]) to obtain a language vector q. We concatenate x and qinto a single vector and project it to a 512-dimensional space for refinement. Therefinement process consists of five weight-sharing fully-connected layers of thesame size. Empirically, using more weight-shared fully-connected layers yields11OptiBoxMulti-Step OptiBoxr1 r2r’1 rk r’kr3r’2 r’k-1Current BoxGround-truth BoxOptiBox OptiBoxStep 1 Step 2 Step kFigure 3.2: Multi-step OptiBox. The one-step OptiBox model is applied it-eratively k times to adjust the bounding box. During the k steps of re-finement, an ideal multi-step model would gradually move the inputbounding box (shown in orange) towards the ground truth box (shownin blue).marginal improvements, and the choice of weight sharing is to avoid overfitting.Further, we project the output of the final refinement layer to a box adjustmentvector tˆ = [tˆx, tˆy, tˆw, tˆh] ∈ R4. All fully connected layers are followed by ReLU.Finally, we decode the adjustment vector tˆ to produce a new bounding box r′.To be consistent with existing methods, we define the relationship between theinput bounding box r, the adjustment prediction tˆ, and the output bounding box r′according to [15]. In details, during inference, the output box r′ can be obtained byr′x = rwtˆx+ rx r′y = rhtˆy+ ryr′w = rw exp(tˆw)r′h = rh exp(tˆh). (3.1)During training, given the input bounding box r = [rx,ry,rw,rh] ∈ R4 and theground truth box g = [gx,gy,gw,gh] ∈ R4, the targets of our bounding box regres-sion model are defined bytx = (gx− rx)/rw ty = (gy− ry)/rhtw = log(gw/rw) th = log(gh/rh) . (3.2)123.2 Multi-Step RefinementGiven an input query, a bounding box r, and an image, the one-step OptiBox modelreturns a new box r′, which can be computed according to Eq. 3.1 using the pre-dicted adjustment tˆ and the input box r. One could extend the one-step model byapplying the model iteratively to perform k steps of refinement. Figure 3.2 depictsthe refinement process. Initially, the OptiBox model at step one takes in the inputbox r1 (shown in orange) and outputs a refined box r′1. At step two, the refined boxis considered as the input of the next iteration by setting the new input r2 to be theprevious prediction r′1, which leads to the new box r′2. After k steps of the sameprocess, the final output bounding box r′k is returned. The blue box indicates theground truth bounding box and the figure shows the ideal case when the multi-stepprocess gradually moves the bounding box towards the ground truth. Note that thisis a general case of the one-step model. If we let k= 0, it corresponds to an identifyfunction, which returns the input box of the multi-step model. If we let k = 1, itcorresponds to the one-step OptiBox model in the previous section.Loss Criterion: We experimented different criteria including the L1 loss,smooth L1 loss and L2 loss (sorted by their validation performance in descend-ing order). Since the L1 loss works the best, we use it to guide the training of bothone-step and multi-step OptiBox. For a k-step OptiBox network (where k≥ 1), weaverage the L1 loss across k time steps by`1 =1k ∑j=1,··· ,k∥∥tˆ j− t j∥∥1 (3.3)where tˆ j and t j are the prediction from OptiBox and the target adjustment at timestep j. The target t j can be computed from r j (the input bounding box at time stepj), and the ground truth box g using Eq. Integrating OptiBox into the State of the ArtThe motivation of our approach is to improve bounding box precision for visualgrounding by adapting the proposed refinement architecture to various groundingmodels. We have incorporated OptiBox into two published grounding models:GroundeR [47] and the One-Stage Grounding [57] model. Here we briefly intro-13hx1 z1r1“A child” LSTMBackbonex2 z2xN zNVisual FeaturesLanguage FeaturesScoring FeaturesCompatibility ScoresrNr2Image RPN LinearReLULinearLinearReLULinearReLULinear Argmax“box 2” is “a child”α1α2αNLinearLinearFigure 3.3: An illustration of the GroundeR [47] model. Weight-shared linearlayers are in the same color.duce the two models. More details are provided in the original papers.3.3.1 GroundeRThe GroundeR model was introduced by Rohrbach et al. [47]. As illustrated inFig. 3.3, an input image first goes through an object detector to obtain N boundingbox proposals ri = [rx,ry,rw,rh] of potential objects, where i ∈ [1, . . . ,N]. Foreach proposal, its visual features xi are extracted from the detection backbone.The query phrase is then encoded using an LSTM [20] to obtain the last hiddenstate linguistic features h. To combine the two modalities, we first project xi andh both into a common dimensionality space using two separate fully-connectedlayers with the ReLU activation. The projected query feature vector is then addedto each of the projected proposal features in the common space to produce a featurevector zi for each proposal. At this point, zi should contain information from boththe query and the object i. At inference time, we score the correspondence betweenthe pairs by projecting zi to a scalar αi and assign the query to the proposal with themaximum αi. During training, the cross-entropy loss is applied against the targetbox, which is the proposal with the highest IoU out of all and with an IoU of at least0.5 with the ground truth. Since not all bounding boxes in the proposal stage arerelevant to the query, to incorporate OptiBox into this model, we allow a one-stepOptiBox to refine the output box of GroundeR (i.e., after the argmax operator).14256*256Darknet53 + FPNLanguageencoderQuery"Two people sitting."1*1 Conv(tx, ty, tw, th, conf)Fusion ModuleLanguageMappingDuplicateSpatial Coordinates1*1 Conv1*1 Conv1*1 ConvW1 W2Fusion ModuleFusion ModuleGroundingModuleFigure 3.4: An illustration of the One-Stage Grounding [57] model. The fig-ure is taken from the original paper.3.3.2 One-Stage GroundingThe One-Stage Grounding model [57] uses the input phrase to guide the objectproposal stage. As illustrated in Fig. 3.4, they first obtain a language vector of theinput phrase using a language encoder and a feature pyramid of the input imageusing a feature pyramid network [30]. Instead of using a standard region proposalnetwork (RPN), they broadcast the language vector into each grid of the featuremaps, which allows the model to directly propose query-related bounding boxes.Using a YOLO [41] architecture, they implicitly maintain 4,032 query-conditionedbounding boxes whereas conventional grounding methods use about 100 proposals,which avoids the case when none of the box proposals are close to the target object.To integrate with the OptiBox model, we refine the most confident box from theOne-Stage Grounding model [57]. The result of the refinement is considered tobe the result of the grounding process. To study the effects of having multiplerefinement steps on a state-of-the-art model, we experimented different variationsof the multi-step OptiBox model in the following chapter.15Chapter 4ExperimentsIn this chapter, we devise two types of experiments to evaluate our proposed Opti-Box refinement model: a controlled experiment (Section 4.2) to showcase prop-erties of OptiBox, and an experiment of incorporating OptiBox into publishedgrounding models [47, 57] (Section 4.3). In Section 4.1, we describe the Flickr30kEntities [38] dataset for the experiments. Further, the limitations and failure casesof OptiBox are discussed in Section 4.4. Finally, we summarize the implementa-tion details of the models in different experiments in Section DatasetWe use the Flickr30k Entities [38] dataset to train and evaluate our model. TheFlickr30k Entities dataset consists of 31,783 images where each image comes withfive captions. Noun phrases for each caption are extracted and each phrase is asso-ciated with a human-annotated bounding box. Following the convention of previ-ous research, we use the data split provided by the authors [38] with 1,000 imagesfor validation, 1,000 images for testing and the rests for training.Figure 4.1 shows an image with annotations from the dataset. There are fivecaptions describing the image and there are five entities of interest. Each sentencehas a reference to each entity indicated by its color. Each entity is associated with aground truth bounding box. Note that although sentences are provided in Flickr30kEntities [38], since we do not assume phrases come with sentences, for each pre-16Figure 4.1: One example from the Flickr30k Entities [38] datasetdiction, we only consider an individual phrase as input and output a bounding box.4.2 Denoising ExperimentIn this section, we investigate the basic properties of the OptiBox model via a con-trolled experiment. Details of the experimental setup are mentioned in Section4.2.1. The experiment is used to answer questions such as: how different refine-ment steps affect the performance of OptiBox (Section 4.2.2) and how sensitiveOptiBox is regarding the choice of visual backbones and language models (Sec-tion 4.2.3). In Section 4.2.4, we take a closer look at individual predictions interms of the IoU change between the bounding box and the ground truth beforeand after using OptiBox. Finally, the qualitative results are provided in Section4. Experiment DescriptionTo investigate the properties of the OptiBox model, we devise a controlled denois-ing experiment. In this experiment, we first move the ground truth bounding boxin the spatial dimensions by uniform noise in the range of [−α,α] where α is in17pixels while ensuring at least a part of the box is within the image. We crop thebox so that only the part of the box inside the image is considered. The medianheight and width of the images are 375 pixels and 500 pixels respectively. The ob-jective of OptiBox is to predict the original ground truth box from the noisy groundtruth box. When α is smaller, the input boxes will be closer to the ground truth;when α is larger, the input boxes will be farther away from the ground truth. Bycontrolling the noise level, we can examine OptiBox’s response to different start-ing bounding box conditions. We use two metrics to evaluate the performance ofOptiBox in different noise levels: visual grounding accuracy and Intersection overUnion (IoU). Although visual grounding accuracy is the most commonly used met-ric in the literature, since our motivation of OptiBox is on improving the boundingbox precision, visual grounding accuracy alone might not be the most suitable toevaluate this aspect. Under the grounding accuracy metric, a prediction is consid-ered to be positive as long as the predicted box has an IoU to the ground truth ofat least 0.5, regardless of whether the IoU is much higher than 0.5. If a predictedbounding box has an IoU of 0.6 (greater than 0.5), even if OptiBox regresses thebox to obtain an IoU of 0.8 (a box with higher precision), the accuracy will staythe same. Therefore, in addition to the conventional grounding accuracy metric,we inspect the exact changes in bounding box precision by keeping track of thechanges in IoU for individual predictions.4.2.2 Iterative RefinementWe study the behavior of OptiBox by varying the noise level. As the noise levelgets higher, more bounding boxes are moved farther away from the ground truth,which results in a lower accuracy to begin with. Table 4.1 shows the accuracyof OptiBox when trained to refine in both the one-step and the multi-step settingsstarting with boxes from different noise levels. We pick the three noise levels(40, 80, and 160 pixels) to represent the cases where the starting boxes have high,medium and low proximity to the ground truth. The step-0 accuracies are theaccuracies of boxes after the random jitter on the ground truth boxes (no box re-finement). The step-1 accuracies are the accuracies of boxes refined by a one-stepOptiBox model. As an example, with uniform noise in the range [−80,80] pixels,18Num Steps 40 Pixels (%) 80 Pixels (%) 160 Pixels (%)0 73.19 46.88 20.551 87.72 75.11 57.112 88.10 78.73 64.144 87.13 78.71 66.158 86.76 78.45 61.91Table 4.1: Accuracy of one-step and multi-step OptiBox when trained in dif-ferent noise levels and different number of refinement steps. Note thatzero-step corresponds to the accuracy before doing refinement.the starting accuracy is 46.88%, which is refined by a two-step OptiBox model toobtain an accuracy of 78.73%.0 2 4 6 8Num Refinement Steps020406080100Grounding Accuracy10px20px40px80px160px320pxFigure 4.2: Accuracy at each refinement step of an eight-step OptiBox modelwhen trained in different noise levels.The optimal number of steps differs according to different starting accuracies(or noise levels). For the lower noise levels (40 and 80 pixels), the optimal numberof steps is two. For a higher noise level (160 pixels), the optimal number of stepsis doubled. Thus, when the starting box is closer to the ground truth, the modelprefers fewer refinement steps. When the starting box is further away, more steps19are favored. This observation is consistent with our motivation of multi-step Opti-Box as the model needs to perform a visual search on the image when the startingbox is far from the target. The table also shows that having more refinement stepsdoes not further increase the accuracy. We hypothesize that the model overfits as itbecomes more expressive by increasing the number of steps. When comparing thetraining and test accuracies for noise level 80 for example, the 2-step and the 8-stepmodels have the training accuraies of 83.97% and 84.86% respectively. Therefore,the gaps between training and test accuracy for the 2-step model and the 8-stepmodel are 5.24% (83.97%−78.73%) and 6.41% (84.86%−78.45%) respectively,which suggests that the 8-step model slightly overfits.So far, we have seen the final accuracies of OptiBox when trained with a dif-ferent number of steps. Now we examine the accuracy at individual refinementsteps for a given model. Figure 4.2 shows the accuracy at each refinement step foran eight-step OptiBox model trained with different amount of noise. We train theeight-step OptiBox model under a fixed noise level for 20 epochs using a VGG16backbone [50] pre-trained on Visual Genome [27] and an LSTM [20] languagemodel with GloVe embeddings [37]. The model with the highest validation accu-racy is saved for evaluation. Across all noise levels, applying OptiBox iterativelyalways improves the initial accuracy and converges. The number of steps for con-vergence (maximum accuracy on the curve) are highlighted as dots in the figure.With a low noise level of 10, OptiBox improves the accuracy marginally asthere is a limited gap for improvement. Importantly, this shows that the modelchooses to stay near the ground truth when the starting box is already very accu-rate. As we increase the noise level, the positive effect of OptiBox becomes moresignificant. Especially for the noise levels of 160, the accuracy goes to 63.26%which is a 42.71% gain from the starting 20.55% accuracy. When the starting ac-curacy is too low (for example, 6.35% with 320 pixels) the accuracy goes from6.35% to 44.56%, which is a lower percentage gain (38.21%) compared to the160 noise level. Additionally, the figure also shows that although OptiBox alwaysimproves the accuracy regardless of the noise level, the converging accuracy is lim-ited by the starting accuracy. For example, starting with an accuracy of 6.35% (320pixels) can only lead to an accuracy of 44.56% while starting at a higher accuracyof 46.88% (80 pixels) the model converges to around 79%.204.2.3 Visual and Language BackbonesVisual Backbone: Table 4.2 shows the effect of different visual backbones forthe multi-step OptiBox model using an LSTM [20] language encoder with GloVeembeddings [37]. The model is trained in the same way as the previous sectionexcept for the choice of visual backbones and the optimal number of refinementsteps are used. Across different noise levels, the accuracy of the OptiBox modelusing ResNet101 [16] pre-trained on ImageNet [28] is consistently higher thanusing the VGG16 [50] backbone pre-trained on the same dataset.Backbone 40 px (2 Steps) 80 px (2 Steps) 160 px (4 Steps)VGG16 (ImageNet) 75.71 58.21 44.05VGG16 (VG) 88.10 78.73 66.15ResNet101 (ImageNet) 77.04 61.84 47.81Table 4.2: Accuracy of a multi-step OptiBox using different visual backbonesin different noise level settings.Comparing different pre-trained datasets while having the same visual back-bones (VGG16 [50]), pre-training on Visual Genome [27] has a relatively signifi-cant gain to ImageNet [28]. This behavior is consistent with the ablation providedby the GroundeR model [47]. The experiment [47] shows that detection featuresare more useful than classification features for visual grounding. We hypothesizethat the object detection task requires more focus on capturing spatial informationthan the image-level classification task and bounding box regression relies on spa-tial knowledge. Additionally, the choice of pre-training dataset and the pre-trainingtask has a higher impact on the accuracy than the choice of backbone architectures.Language Backbone: Table 4.3 shows the impact of different phrase encoderson the performance of OptiBox. In this experiment, the more powerful BERT [8]model is used to compare against the LSTM [20] model with GloVe embeddings[37]. We use a multi-step OptiBox model with VGG16 [50] pre-trained on VisualGenome [27]. The table shows that having a more powerful language encodercan increase the model’s grounding performance but the improvement is marginal.This result is reasonable as the phrases from Flickr30k [38] are relatively short andsimple, which does not require very sophisticated language representation.21Language Encoder 40 px (2 Steps) 80 px (2 Steps) 160 px (4 Steps)bert-base-uncased 88.34 79.09 66.14LSTM (GloVe) 88.10 78.73 66.15Table 4.3: Accuracy of an eight-Step OptiBox using different language en-coders in different noise level settings.4.2.4 Changes in IoU DistributionStep 1Step 8(c)(f)40 Pixels(a)(d)10 Pixels 20 Pixels(b)(e)Figure 4.3: Contour plot of IoU changes for lower noise levelsSince the primary focus of OptiBox is upon improving the bounding box pre-cision using the natural language query, and grounding accuracy is not a suitablemetric to reflect this property, we investigate the changes of Intersection over Union(IoU) for individual predictions. In particular, we plot probability distributionswhose two random variables are the IoU before and after the OptiBox refinementusing kernel density estimation. The OptiBox we use is an eight-step refinementmodel with VGG16 backbone [50] pre-trained on Visual Genome [27] with anLSTM [20] phrase encoder. The model is trained in different noise level settings.Figure 4.3 and Figure 4.4 shows the changes in IoU distributions before and after22the refinement in the first refinement step and the eighth refinement step. The for-mer figure shows the IoU changes under the influence of lower noise levels. Thelatter shows the changes under high noise levels.As an example, Figure 4.3 (c) shows a two-dimensional distribution centeredabove the diagonal line. This distribution is desirable as it reflects higher outputIoUs compared to the input in most cases. One can imagine that an ideal OptiBoxmodel would always return a box with an IoU of 1.0 regardless of the input IoU.That would be a horizontal line at the top of the 2d distribution plot. If an OptiBoxmodel is an identity function, we expect the distribution is on the diagonal of theplot (see Figure 4.3 (a)). Additionally, the marginal distributions of IoU before andafter the refinement are located on the top and the right of the 2d distribution plot.Step 1Step 8(c)(f)320 Pixels(a)(d)80 Pixels 160 Pixels(b)(e)Figure 4.4: Contour plot of IoU changes for higher noise levelsThere are a few common properties reflected in the two figures. To begin with,the spread of the 2d distribution in the 8th step is always higher than the 1st steprefinement. Further, the centers of the 2d distributions are located in the regionabove the diagonal line, which suggests improvements in bounding box precisionacross all noise levels. The improvement of bounding box precision is the most23significant when the noise level at 80 pixels and 160 pixels. If the noise level istoo low, say 10 pixels, there is little room for improvement. In such a case, thenetwork decides to stay at the diagonal line (see Figure 4.3 (a)). Comparing step1 and step 8 in the lower noise level settings (see Figure 4.3), to avoid the spreadof the distribution, fewer steps are desired. If the noise levels are high (see Figure4.4), more steps can bring the distribution further to the top of the plot.4.2.5 Qualitative AnalysisFigure 4.5: Positive qualitative examples of OptiBox refinement.Table 4.1 shows that different noise levels have a different optimal number of24refinement steps. The optimal number of steps for the 40 pixels, 80 pixels, and 160pixels settings are two, two, and four respectively. We display some qualitativeexamples of the three noise level settings using their optimal number of steps inFigure 4.5 using their optimal steps. The blue box is the ground truth bounding box.The phrase query is shown at the top-left corner of each image where the individualwords are separated by an underscore for display purpose. The starting and endingtokens are shown. Apart from the blue box, all other boxes are predictions fromOptiBox. Earlier predictions are shown in darker colors while the more recentpredictions are shown in lighter colors. As an example, for the phrase “lady” inFigure 4.5, OptiBox starts with the darkest box on the right and converges to thewhite box which is more similar to the blue ground truth box than the starting box.4.3 OptiBox with the State-of-the-Art ModelsWe have adapted OptiBox into two published visual grounding models (GroundeR[47] and One-Stage Grounding [57]) to examine the effect of OptiBox on the stateof the art visual grounding models.GroundeR + OptiBox: Given a phrase and an image, the GroundeR modeluses a region-proposal network (RPN) from a detection backbone to obtain objectproposals. The object proposals are then scored by the input phrase using a linearlayer. The proposal with the highest score is chosen to be the grounding result.To employ OptiBox in GroundeR, we start with the bounding box returned byGroundeR and refine the bounding box using OptiBox.One-Stage + OptiBox: Comparing to GroundeR, the One-Stage Groundingmodel does not rely on a generic region proposal network (RPN) for candidateproposal as many of the proposals would be irrelevant to the query. Instead, theypropose bounding boxes by conditioning on the query based on a YOLO frame-work. At the end of the grounding stage, a feature map is returned which encodesaround 4,032 latent bounding boxes. The bounding box with the highest confi-dence score is returned. Similar to GroundeR, the bounding box returned by thegrounding model is refined using OptiBox.25Approach Visual Features Finetune Accuracy (%)SCRC [22] VGG16 ImageNet 27.80DSPE [54] VGG19 Pascal 43.89GroundeR [47] VGG16 Pascal 47.81CCA [38] VGG19 Pascal 50.89MCB + Reg + Spatial [5] VGG16 Pascal 51.01Similarity Net [55] VGG19 Pascal 51.05MNN + Reg + Spatial [5] VGG16 Pascal 55.99SeqGROUND [9] VGG16 N/A 61.60CITE-Resnet [39] ResNet101 COCO 61.33CITE-Pascal [39] VGG16 Pascal 59.27CITE-Flickr30K [39] VGG16 F30K 61.89QRC Net [6] VGG16 F30K 65.14GraphGround [34] VGG16 Visual Genome 63.87GraphGround++ [34] VGG16 Visual Genome 66.93One-Stage [57] Darknet53-FPN COCO (F30K) 67.62GroundeR ResNet101 Visual Genome 62.15GroundeR + OptiBox ResNet101 Visual Genome 65.20One-Stage Darknet53-FPN COCO (F30K) 66.12One-Stage + OptiBox-1 Darknet53-FPN COCO (F30K) 66.12One-Stage + OptiBox-2 Darknet53-FPN COCO (F30K) 66.17Table 4.4: Performance comparison of the state-of-the-art models on theFlickr30k Entities dataset.4.3.1 Grounding Accuracy PerformanceTable 4.4 shows the accuracy of the two models in comparison to other publishedmodels. Our implementation of the GroundeR baseline model has an accuracy of62.15%. With one step of OptiBox refinement, the accuracy increases to 65.20%.Note that our GroundeR implementation has significantly higher accuracy than theoriginal model. The reason is that the original GroundeR model makes use of a se-lective search proposal mechanism as opposed to a Faster R-CNN [44] framework.Table 4.5 shows the upperbound accuracy from the two proposal mechanisms alongwith other state-of-the-art methods. The proposal upperbound is the ratio betweenthe number of phrases that have at least one proposal having a more than 0.5 IoUto the ground truth, and the number of phrases in total. Additionally, the choice26Approach Proposal UB (%) Accuracy (%)GroundeR [47] 77.90 47.81RPN+QRN [6] 71.25 53.48SS+QRN [6] 77.90 55.99PGN+QRN [6] 89.61 60.21One-Stage [57] 95.48 68.69GroundeR 84.00 62.15Table 4.5: Comparison of proposal upper bounds of various state-of-the-artmodels with our models. In the QRN network, RPN, SS and PGN aredifferent proposal methods.of visual backbone and the pre-training dataset can have a significant impact onvisual grounding accuracy.In terms of the One-Stage Grounding model, our baseline has an accuracy of66.12% which is similar to the accuracy 67.62% of the original model (see Table4.4). Using a one-step OptiBox model (trained separately) to refine the output ofthe One-Stage Grounding model does not change the accuracy of the baseline (seeOne-Stage + OptiBox-1). With a two-step OptiBox model (trained separately),there is a marginal accuracy increase.4.3.2 IoU Distribution Analysis0.0 0.2 0.4 0.6 0.8 1.0Intersection over Union (IoU)024Probability DensityGroundeROne-Stage0.0 0.2 0.4 0.6 0.8 1.0Intersection over Union (IoU)024Probability DensityGT20GT40GT80Figure 4.6: A comparison of different IoU distributions27(a)One-stage GroundGroundeR(b)Figure 4.7: Contour plot of IoU changes before and after the refinement bya One-Step OptiBox model for the GroundeR [47] model and the One-stage Grounding [57] model. The percentages of median IoU improve-ment (non-zero IoUs) for the two models are 10.83% and 0.15%Figure 4.6 shows the IoU distributions of the two published models beforeOptiBox refinement compared against the IoU distributions in the denoising ex-periment in Section 4.2.2. The top part of the figure shows that by conditioningon phrases, the one-stage model has a IoU distribution centered in the higher IoUregions in contrast to the generic RPN from GroundeR. The IoU distribution of theGroundeR model gives a IoU distribution that is in between the ones from the de-noising experiment with 40 pixels and 80 pixels of uniform noise. Figure 4.4 showsthat OptiBox works exceptionally well under the 80 pixel noise level. Therefore,this phenomenon explains the more significant increase in visual grounding accu-racy for GroundeR compared to the One-Stage Grounding model.Figure 4.7 shows the IoU distribution (the IoU between the output boxes andthe ground truth) of the baseline models and their one-step OptiBox counterparts.For the GroundeR [47] model, its distribution is centered at the triangle on the left,which suggests a significant improvement in bounding box precision. The percent-ages of median IoU improvement is 10.83%. In particular, we divide the deltas28of IoUs before and after the refinement by the starting IoUs. Since division byzero is not defined, the percentage is only based on predictions that have non-zerostarting IoUs. In contrast, the distribution of the One-stage Grounding [57] modellies on a diagonal line and its percentage of median IoU improvement is 0.15%.Figure 4.7 and Table 4.4 show that the improvement of bounding box precision andgrounding accuracy for the GroundeR model is more significant compared to theOne-stage Grounding model. There are two reasons. First, the GroundeR modelhas 100 bounding box proposals and not all proposals are related to the query whilethe One-stage Grounding model implicitly maintains 4,032 query-related boxesthrough a YOLO formulation. When the most confident box is returned for the twomodels, it is more likely for the One-stage Grounding model to have a predictedbox that is very close to the ground truth. Second, the GroundeR model does notperform bounding box regression at the end while the One-stage Grounding modelhas a linear regression layer for box refinement through convolution. Due to thelarge quantity of bounding box proposals in the One-stage Grounding model, alinear layer is sufficient for the box refinement task.4.4 Discussion and LimitationsFigure 4.8: Three main cases where OptiBox fails.29Although OptiBox shows consistent bounding box precision improvement inmany cases, it still has certain limitations. Here we showcase the scenarios whenOptiBox tends to fail in a qualitative manner. In Figure 4.8, we categorize threemajor cases where OptiBox fails. First of all, when there are multiple similarentities, OptiBox tends to predict a box (the white box) that is the average regionof all entities as opposed to a more precise prediction. For example, given thephrase “a black dog” when there are two dogs are present, OptiBox predicts a boxcontaining both dogs. Further, since the grounding takes in only individual phrasesfrom a sentence, the lack of contextual information leads to out-of-context results.For example, the phrase “couple” in the original sentence actually refers to thecouple sitting across each other instead of including the woman in the background.Lastly, sometimes the predicted boxes could be tighter around the objects.4.5 Implementation DetailsOptiBox: For the controlled experiments, we train OptiBox to regress to theground truth. We use the Adam optimizer [26] with a learning rate of 1e-3 andno weight decay to train the network for 20 epochs with a batch size of 64 queries.To encourage stability for the training process, we first train the OptiBox to per-form a single step refinement in the first two epochs. Afterward, we allow thenetwork to perform iterative refinement using the supervision provided by Eq. 3.3.To provide a good initialization of the language model, we first train an LSTM [20]autoencoder to reconstruct the training phrases and sentences from Flickr30k [38]until convergence. The trained encoder weights are then used by OptiBox in theinitialization process. The autoencoder with the lowest validation loss is chosen.As to the word embeddings, the 200-dimensional GloVe embeddings [37] from theTwitter corpus are used. Similarly, we pick the OptiBox with the highest validationaccuracy for the controlled experiments. As to the dimensions of the various com-ponents, the LSTM encoder provides a 512-dimensional hidden vector to capturethe semantics of each phrase. In a latter experiment, we use the BERT encoder [8],which has a 768-dimensional hidden vector. In the experiments of studying theimpact of different backbones (VGG16 [50] and ResNet101 [16]) in the OptiBoxframework, we extract the feature map before the adaptive average pooling and the30classification heads for both backbones.GroundeR + OptiBox: We adapt the ResNet101 network [16] pre-trained onVisual Genome [27] with the top 200 most frequent object class labels. To be con-sistent with most existing approaches, we do not finetune the detection backboneon Flickr30k [38]. We allow the region proposal network to generate 50 proposalsfor each image. The bounding box visual features xi are also batch-normalized, fol-lowed by a projection to R128. The resulting vector is summed with the projectedhidden states from the query phrase. The aggregated features then go through afully connected layer that forms the attention over the proposed bounding boxes,and the attention values are penalized against the target using cross-entropy. Wetrain our grounding model for 25 epochs with weight decay 0.0005. We use theAdam optimizer [26] with batch size 128 and a scheduled learning rate (that de-cays from 0.001 by 1/10 at epoch 15 and 25), and we select the model maximizingvalidation accuracy for reporting on the test set.Since the performance of the bounding box regression is conditioned on theoutputs of the grounding model, we begin training a single-step OptiBox modelwhen the grounding model converges. Once the regression network starts training,the GroundeR model is set to be in the inference mode. To train the regressionmodel, we use Adam [26] with a learning rate of 0.0001 and a batch size of 128until convergence. Again, the `1-loss works the best in our validation.One-Stage + OptiBox: For the One-Stage Grounding model [57], we closelyfollow the training procedure for the LSTM [20] version of the model in the origi-nal paper. Table 4.4 indicates that we can reproduce the results close to the reportedresults from the original paper. The OptiBox model is trained jointly with the One-Stage Grounding model. However, since the OptiBox model is conditioned on theoutput of the One-Stage Grounding model [57], we do not update OptiBox untilthe One-Stage Grounding model [57] provides reasonable results. Thus, we startupdating OptiBox at epoch four. To encourage the stability of the optimization pro-cess, for a multi-step OptiBox model, we train the OptiBox model on a one-steprefinement starting at epoch four and enable multi-step refinement at epoch eight.31Chapter 5ConclusionVisual grounding is the problem of locating an entity within an image using anatural language query. Most recent methods in the literature are limited by theprecision of bounding boxes (the overlap between a box and the ground truth an-notation) from a region proposal network. To this end, we introduce a query-guidedbounding box refinement network (OptiBox) that adjusts the output of a groundingsystem so that the output boxes are tight around the target objects. Our experimentsshow that the proposed model can improve bounding box precision significantlywhen the boxes are away from the ground truth. We adapt the OptiBox modelto the GroundeR model [47] and the One-Stage Grounding model [57] and showthat our model can improve grounding accuracy for a two-stage grounding systemwhile maintaining the accuracy for a one-stage grounding system.32Chapter 6Future WorkOur experiments have shown that OptiBox can improve the overlap between bound-ing boxes and their ground truth and it can increase the grounding accuracy of atwo-stage grounding system. To extend the OptiBox model, there are three futurework directions. First of all, the OptiBox model refines the box of each phrase inisolation. In some applications, phrases come from sentences. We could extendthe OptiBox model for such applications by leveraging the context within a sen-tence to facilitate more effective refinement. One way to leverage such a contextis to enable the refinement of some phrases to be more correlated. For example,given a sentence “a woman is wearing a white jacket and a man is wearing a bluejacket.”, the ground truth box of “a woman” should be closer to the box of “a whitejacket” than “a blue jacket.” Another future direction is motivated by the fact thatOptiBox only looks at one patch of the image at a time. This formulation is prob-lematic when the initial bounding box position is too far away from the target. Oneway to address this is to add some global context of the image, for example, eitherby effectively utilizing a global feature map of the image during the search or byincorporating a memory module to encode the history of the search. Finally, thecurrent OptiBox design requires the model to perform a fixed number of refine-ment steps. We can avoid excessive iterations when an input box at a time step hasa high overlap with the ground truth. The most straightforward way to implementthis would be to add a classification head to the model so that it makes a decisionabout whether to refine before performing an extra refinement step.33Bibliography[1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, andL. Zhang. Bottom-up and top-down attention for image captioning andvisual question answering. In CVPR, 2018. → page 4[2] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, andD. Parikh. Vqa: Visual question answering. In ICCV, 2015. → page 5[3] Y. Balaji, M. R. Min, B. Bai, R. Chellappa, and H. P. Graf. Conditional ganwith discriminative filter generation for text-to-video synthesis. In IJCAI.AAAI Press, 2019. → page 4[4] H. Ben-Younes, R. Cadene, M. Cord, and N. Thome. Mutan: Multimodaltucker fusion for visual question answering. In ICCV, 2017. → page 5[5] K. Chen, R. Kovvuri, J. Gao, and R. Nevatia. Msrc: Multimodal spatialregression with semantic context for phrase grounding. In ICMR. ACM,2017. → page 26[6] K. Chen, R. Kovvuri, and R. Nevatia. Query-guided regression network withcontext policy for phrase grounding. In ICCV, 2017. → pages 2, 8, 26, 27[7] Y.-C. Chen, L. Li, L. Yu, A. E. Kholy, F. Ahmed, Z. Gan, Y. Cheng, andJ. Liu. Uniter: Learning universal image-text representations. arXiv preprintarXiv:1909.11740, 2019. → page 6[8] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training ofdeep bidirectional transformers for language understanding. NAACL-HLT,2018. → pages 6, 21, 30[9] P. Dogan, L. Sigal, and M. Gross. Neural sequential phrase grounding(seqground). CVPR, 2019. → pages 2, 5, 7, 8, 2634[10] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach.Multimodal compact bilinear pooling for visual question answering andvisual grounding. EMNLP, 2016. → page 5[11] L. Gao, Z. Guo, H. Zhang, X. Xu, and H. T. Shen. Video captioning withattention-based lstm and semantic consistency. IEEE Transactions onMultimedia, 2017. → page 4[12] P. Gao, Z. Jiang, H. You, P. Lu, S. C. Hoi, X. Wang, and H. Li. Dynamicfusion with intra-and inter-modality attention flow for visual questionanswering. In CVPR, 2019. → page 5[13] S. Gidaris and N. Komodakis. Object detection via a multi-region andsemantic segmentation-aware CNN model. In ICCV, 2015. → page 9[14] R. Girshick. Fast r-cnn. In ICCV, 2015. → page 9[15] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchiesfor accurate object detection and semantic segmentation. In CVPR, 2014. →pages 9, 12[16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for imagerecognition. In CVPR, 2016. → pages 21, 30, 31[17] K. He, G. Gkioxari, P. Dolla´r, and R. Girshick. Mask r-cnn. In ICCV, 2017.→ page 9[18] K. He, G. Gkioxari, P. Dolla´r, and R. B. Girshick. Mask R-CNN. TPAMI,2020. → page 11[19] Y. He, C. Zhu, J. Wang, M. Savvides, and X. Zhang. Bounding boxregression with uncertainty for accurate object detection. In CVPR, 2019. →page 9[20] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neuralcomputation, 1997. → pages 4, 11, 14, 20, 21, 22, 30, 31[21] H. Hotelling. Relations between two sets of variates. In Breakthroughs instatistics. Springer, 1992. → page 7[22] R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Darrell. Naturallanguage object retrieval. In CVPR, 2016. → pages 2, 7, 26[23] B. Jiang, R. Luo, J. Mao, T. Xiao, and Y. Jiang. Acquisition of localizationconfidence for accurate object detection. In ECCV, 2018. → page 935[24] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generatingimage descriptions. In CVPR, 2015. → page 1[25] S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg. Referitgame:Referring to objects in photographs of natural scenes. In EMNLP, 2014. →page 5[26] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization.ICLR, 2014. → pages 30, 31[27] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen,Y. Kalantidis, L.-J. Li, D. A. Shamma, et al. Visual genome: Connectinglanguage and vision using crowdsourced dense image annotations. IJCV,2017. → pages 20, 21, 22, 31[28] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification withdeep convolutional neural networks. In NIPS, 2012. → page 21[29] Y. Li, M. R. Min, D. Shen, D. Carlson, and L. Carin. Video generation fromtext. In AAAI, 2018. → page 4[30] T.-Y. Lin, P. Dolla´r, R. Girshick, K. He, B. Hariharan, and S. Belongie.Feature pyramid networks for object detection. In CVPR, 2017. → page 15[31] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dolla´r. Focal loss for denseobject detection. In ICCV, 2017. → page 9[32] J. Lu, D. Batra, D. Parikh, and S. Lee. Vilbert: Pretraining task-agnosticvisiolinguistic representations for vision-and-language tasks. In NeurIPS,2019. → page 6[33] J. Lu, V. Goswami, M. Rohrbach, D. Parikh, and S. Lee. 12-in-1: Multi-taskvision and language representation learning. arXiv preprintarXiv:1912.02315, 2019. → page 6[34] L. W. M. Bajaj and L. Sigal. Graphground: Graph-based languagegrounding. In ICCV, 2019. → pages 2, 6, 7, 8, 26[35] A. Mogadala, M. Kalimuthu, and D. Klakow. Trends in integration of visionand language research: A survey of tasks, datasets, and methods. arXivpreprint arXiv:1907.09358, 2019. → page 4[36] J. Mun, L. Yang, Z. Ren, N. Xu, and B. Han. Streamlined dense videocaptioning. In CVPR, 2019. → page 436[37] J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for wordrepresentation. In EMNLP, 2014. → pages 20, 21, 30[38] B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier,and S. Lazebnik. Flickr30k entities: Collecting region-to-phrasecorrespondences for richer image-to-sentence models. In ICCV, 2015. →pages x, 5, 7, 16, 17, 21, 26, 30, 31[39] B. A. Plummer, P. Kordas, M. Hadi Kiapour, S. Zheng, R. Piramuthu, andS. Lazebnik. Conditional image-text embedding networks. In ECCV, 2018.→ pages 2, 26[40] R. N. Rajaram, E. Ohn-Bar, and M. M. Trivedi. Refinenet: Iterativerefinement for accurate object localization. In ITSC, 2016. → page 9[41] J. Redmon and A. Farhadi. Yolov3: An incremental improvement. arXivpreprint arXiv:1804.02767, 2018. → pages 8, 9, 15[42] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once:Unified, real-time object detection. In CVPR, 2016. → pages 8, 9[43] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee.Generative adversarial text to image synthesis. ICML, 2016. → page 4[44] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-timeobject detection with region proposal networks. In NIPS, 2015. → pages4, 9, 26[45] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese.Generalized intersection over union: A metric and a loss for bounding boxregression. In CVPR, 2019. → page 9[46] M.-C. Roh and J.-y. Lee. Refining faster-rcnn for accurate object detection.In MVA, 2017. → page 9[47] A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele. Grounding oftextual phrases in images by reconstruction. In ECCV, 2016. → pagesv, ix, x, 2, 3, 6, 7, 13, 14, 16, 21, 25, 26, 27, 28, 32[48] A. Sadhu, K. Chen, and R. Nevatia. Zero-shot grounding of objects fromnatural language queries. In ICCV, 2019. → page 8[49] P. Sharma, N. Ding, S. Goodman, and R. Soricut. Conceptual captions: Acleaned, hypernymed, image alt-text dataset for automatic image captioning.In ACL, 2018. → page 637[50] K. Simonyan and A. Zisserman. Very deep convolutional networks forlarge-scale image recognition. ICLR, 2015. → pages 20, 21, 22, 30[51] W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai. Vl-bert: Pre-trainingof generic visual-linguistic representations. ICLR, 2020. → page 6[52] D. Teney, L. Liu, and A. van den Hengel. Graph-structured representationsfor visual question answering. In CVPR, 2017. → page 5[53] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,Ł. Kaiser, and I. Polosukhin. Attention is all you need. In NeurIPS, 2017.→ page 6[54] L. Wang, Y. Li, and S. Lazebnik. Learning deep structure-preservingimage-text embeddings. In CVPR, 2016. → pages 7, 26[55] L. Wang, Y. Li, J. Huang, and S. Lazebnik. Learning two-branch neuralnetworks for image-text matching tasks. TPAMI, 2018. → page 26[56] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He.Attngan: Fine-grained text to image generation with attentional generativeadversarial networks. In CVPR, 2018. → pages 4, 5[57] Z. Yang, B. Gong, L. Wang, W. Huang, D. Yu, and J. Luo. A fast andaccurate one-stage approach to visual grounding. In ICCV, 2019. → pagesx, 2, 3, 8, 13, 15, 16, 25, 26, 27, 28, 29, 31, 32[58] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo. Image captioning withsemantic attention. In CVPR, 2016. → page 4[59] R. Zellers, Y. Bisk, A. Farhadi, and Y. Choi. From recognition to cognition:Visual commonsense reasoning. In CVPR, 2019. → page 6[60] Z. Zhang, Y. Xie, and L. Yang. Photographic text-to-image synthesis with ahierarchically-nested adversarial network. In CVPR, 2018. → page 4[61] Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei. Visual7w: Groundedquestion answering in images. In CVPR, 2016. → page 238


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items