@prefix vivo: . @prefix edm: . @prefix ns0: . @prefix dcterms: . @prefix skos: . vivo:departmentOrSchool "Science, Faculty of"@en, "Computer Science, Department of"@en ; edm:dataProvider "DSpace"@en ; ns0:degreeCampus "UBCV"@en ; dcterms:creator "Ghotbi, Borna"@en ; dcterms:issued "2019-12-16T19:06:08Z"@en, "2019"@en ; vivo:relatedDegree "Master of Science - MSc"@en ; ns0:degreeGrantor "University of British Columbia"@en ; dcterms:description """In this work, we address the problem of food ingredient detection from meal images, which is an intermediate step for generating cooking instructions. Although image-based object detection is a familiar task in computer vision and has been studied extensively in the last decades, the existing models are not suitable for detecting food ingredients. Normally objects in an image are explicit, but ingredients in food photos are most often invisible (integrated) and hence need to be inferred in a much more contextual manner. To this end, we explore an end-to-end neural framework with the core property of learning the relationships between ingredient pairs. We incorporate a Transformer module followed by a Gated Graph Attention Network (GGAT) to determine the ingredient list for the input dish image. This framework encodes ingredients in a contextual yet order-less manner. Furthermore, we validate our design choices through a series of ablation studies and demonstrate state-of-the-art performance on the Recipe1M dataset."""@en ; edm:aggregatedCHO "https://circle.library.ubc.ca/rest/handle/2429/72773?expand=metadata"@en ; skos:note "Graph-based Food Ingredient DetectionbyBorna GhotbiB.Sc., Sharif University of Technology, 2017A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES(Computer Science)The University of British Columbia(Vancouver)December 2019c© Borna Ghotbi, 2019The following individuals certify that they have read, and recommend to the Faculty of Graduate andPostdoctoral Studies for acceptance, the thesis entitled:Graph-based Food Ingredient Detectionsubmitted by Borna Ghotbi in partial fulfillment of the requirements for the degree of Master of Sci-ence in Computer Science.Examining Committee:Leonid Sigal, Computer ScienceSupervisorJames J. Little, Computer ScienceSupervisory Committee MemberiiAbstractIn this work, we address the problem of food ingredient detection from meal images, which is an inter-mediate step for generating cooking instructions. Although image-based object detection is a familiartask in computer vision and has been studied extensively in the last decades, the existing models arenot suitable for detecting food ingredients. Normally objects in an image are explicit, but ingredientsin food photos are most often invisible (integrated) and hence need to be inferred in a much more con-textual manner. To this end, we explore an end-to-end neural framework with the core property oflearning the relationships between ingredient pairs. We incorporate a Transformer module followed bya Gated Graph Attention Network (GGAT) to determine the ingredient list for the input dish image.This framework encodes ingredients in a contextual yet order-less manner. Furthermore, we validateour design choices through a series of ablation studies and demonstrate state-of-the-art performance onthe Recipe1M dataset.iiiLay SummaryFood computation studies support a variety of applications and services, such as guiding the humanbehavior, improving the human health and understanding the culinary culture. Building on advances ofartificial intelligence in the last few years, we identify ingredients from meal images. This task is morechallenging compared to other object detection tasks as the ingredients are most often invisible in thefood photo. In this thesis, we propose a model which incorporates graphs to capture the relationshipsbetween ingredient pairs to enhance ingredient predictions. We evaluate different design choices byqualitatively and quantitatively evaluating performance on the Recipe1M dataset.ivPrefaceThe entire work presented here is original work done by the author, Borna Ghotbi, performed under thesupervision of Professor Leonid Sigal.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Method Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Background & Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1 Multi-Label Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Sequence to Sequence Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2.1 Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3 Graph-Based Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3.1 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3.2 Graph Convolution Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3.3 Graph Attention Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3.4 Gated Graph Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4 Food Data Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9vi3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.1.1 Image Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1.2 Ingredient Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.2 Results and Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.2.1 Repeating Ingredients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.2.2 Graphs Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.2.3 Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.2.4 Feed-Forward Networks and Auto-Regressive Networks . . . . . . . . . . . . 184.3 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 Conclusion & Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23viiList of TablesTable 4.1 A comparison between different Graph architectures and comparing the results overspecified settings. The Setting columns represent the number of layers, the use of gat-ing function, independence of layer weights, and the use of the attention mechanism,respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Table 4.2 A comparison for masking and selecting top ingredients. . . . . . . . . . . . . . . 18Table 4.3 Models accuracy comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19viiiList of FiguresFigure 1.1 Model Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Figure 2.1 (Left) Transformer Architecture. (Right) Mult-Head attention is composed of atten-tion layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Figure 3.1 The model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Figure 3.2 A Transformer Block: Each block receives the output of the previous block andimage encoding, passes the inputs through layers and generates the next output. . . 12Figure 4.1 The blue line does not perform any masking. The green line performs masking byusing −in f value. The pink line performs masking by using Min value. . . . . . . 16Figure 4.2 Masking before the Graph Network. . . . . . . . . . . . . . . . . . . . . . . . . . 17Figure 4.3 Masking after the Graph Network and prioritizing with Maximum Spanning Tree. . 18Figure 4.4 Qualitative Results. Green colored ingredients are matching ingredients betweenthe prediction and the ground truth. Red color denotes the absence of a predictedingredient in the ground truth. Black color indicates the absence of ground truthingredient in the prediction set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20ixAcknowledgmentsFirstly, I would like to express my sincere gratitude to my advisor Prof. Leonid Sigal for the continuoussupport of my Masters study and related research, for his patience, motivation, and immense knowledge.His guidance helped me in all the time of research and writing of this thesis.Besides my advisor, I would like to thank Prof. Jim Little, as the second reviewer of my thesis fortaking his time out and providing me with his insightful comments.I would also thank the department of Computer Science of University of British Columbia (UBC)for providing a great learning platform and supporting me financially as a teaching assistant. A bigthanks to all my labmates for being such great colleagues and being kind and helpful. I am also thankfulto Vector Institute for AI for providing me the financial and infrastructure support.Last but not the least, I would like to thank my family: my parents and to my sister for supportingme spiritually throughout writing this thesis and my life in general with their unconditional love.xChapter 1Introduction*A ship in harbor is safe, but that is not what ships are built for. — William G.T. Shedd1.1 OverviewFood plays a significant role in our everyday activities. We receive the nutrients from food, and theenergy allows us to stimulate growth, perform activities, work, and learn. Furthermore, healthy foodplays an important role in absorbing essential nutrients and disease prevention.Traditionally, the sources for food-related studies were limited to small-scale data such as cook-books, recipes, and questionnaires. Food perception [40], food consumption [33], food culture [18], andfood safety [6] are some of the different aspects of food analysis back in the days.With the advancements in technology and the emergence of social media, food photography has be-come a popular activity among people. These advancements have also led to an abundance of cookingwebsites and mobile applications. Following the number of resources on the Web, larger food datasetshave been introduced. In addition to food images, these new datasets provide more comprehensiveinformation regarding cooking recipes, which are composed of ingredients and necessary cooking in-structions. New research ideas and directions are shaped by taking advantage of these big-scale cookingdatasets. In the field of Computer Vision and Machine Learning, researchers are mainly concerned withgenerating suitable representations for cooking images, ingredients, and the relative cooking instruc-tions to put them together in a way that enables meaningful transition between representations. Thesetransitions are commonly used for ingredients and cooking instructions recognition and retrieval fromfood images in a multi-modal space.Classification, or recognition, tasks predict ingredients or instructions from an input image. Thisis different from retrieval, where the end goal is to search and retrieve the most similar ingredientsor instructions available in the dataset from an input image. Retrieval tasks also study the oppositeproblem where instructions are retrieved from the images. Considering how recent this field of study is,deep learning methods are the favorite choice of the researchers in the stated approaches.1Although image classification and object detection are traditional problems in Computer Vision andhave been studied vastly in the last decades, the proposed models are not ideally suitable for recognizingdishes and detecting ingredients. Normally objects in an image are explicit and detectable, like pedes-trians in a crowded street or birds flying in a blue sky. By contrast, food photos usually do not providea distinctive spatial layout or discernible appearance for ingredients, e.g., an olive inside a mixed salad.Furthermore, there are other conditions such as point of view or lightning, which make the problemdifficult.On the bright side, the color and texture of the dish can help through inferring a specific type of food.These attributes can help food recognition regardless of the lighting and other variations. As a result,the model should be able to capture local information for this specific type of classification problem.1.2 Problem DefinitionAs discussed earlier, food ingredient prediction is a challenging problem in cooking context whereingredients are not explicitly obvious in a food image. Furthermore, these ingredients are not necessarilyindependent, and there is a relationship between the elements, e.g., beef and onion often occur in thesame recipe as they are cooked together for better taste, or salt and pepper are the common add-ons toa recipe of meat or fish. Therefore we need an architecture capable of modeling dependencies betweenthe ingredients. Besides, when a chef is asked for a cooking recipe, he or she would typically providea list of ingredients as a first step. The word list suggests sequential ordering. But in fact, the items inthe list do not follow any particular order. An ordering could be imposed on the ingredients, e.g., analphabetical ordering, but such ordering is a matter of convenience for the publisher and not inherentin the recipe itself. In other words, distance between ingredients on the alphabetized list is not a goodproxy for how often such ingredients occur together or the order in which they should be combined. Asa result, designing a model capable of predicting the ingredients in an order-less but contextual mannerwould intuitively be helpful.1.3 Method OutlineWe propose an encoder-decoder architecture with the property of learning the relationships betweeningredient pairs where ordering of the ingredients is not an important parameter. We incorporate aTransformer [1] module followed by a Gated Graph Attention Network (GGAT) for our decoder. Thegraph-based module helps the Transformer to tune the parameters and reason over the prediction outputfurther. We also find the use of an attention mechanism essential to weight the edge importance betweenthe graph nodes.1.4 DataWe validate our model with the Recipe1M dataset provided by [4]. Recipe1M contains 1,029,720recipes. These recipes are collected in two stages. Initially, recipes and related images are extractedfrom a couple of cooking websites. Then, these recipes are augmented with additional cooking images2Figure 1.1: Model Overviewfrom the Web using an image search engine. Each recipe should include at least one image to be valuablefor the learning process. As a result, we use the cleaned-up version provided by [37]. In this version,recipes must have at least a single image and include more than two ingredients. Applying this filterprovides us with 252,547 training, 54,255 validation, and 54,506 test samples.The recipes provide ingredients and instructions. The number of unique ingredients across allrecipes is 1488, and the maximum number of ingredients for each recipe is 20. Instructions are fur-ther parsed into phrases, and ingredients are tokenized to a set of words, which results in 23,231 uniquewords. For experimenting with the model proposed in this thesis, we only use the cooking images andingredients. The cooking instructions can be used for instruction generation methods in the future.1.5 OutlineThe thesis is organized as follows: In Chapter 2, we review the related works and provide backgroundknowledge for sequence to sequence models and graph-based networks. Furthermore, we give anoverview of the food-based computing domain. Chapter 3, provides information regarding the archi-tecture design of our model. We discuss the details of the information flow and the training process.Chapter 4 includes our experimental results for different model settings along with the details of thedataset. We also present some qualitative results at the end of this section. Finally, in Chapter 5 wesummarize our main contributions and discuss possible future directions.3Chapter 2Background & Related WorkThe task of ingredient prediction from images lies in the domain of multi-modal learning, a rich fieldof study in Computer Vision and Deep Learning. Our emphasis is on the intersection of language andvision and therefore, this section points out relevant architectures in order to solve the problems in thesedomains. On top of that, we review the recent approaches for food analysis.2.1 Multi-Label ClassificationClassification tasks are among the popular problems in Computer Vision. In contrast to single-labelclassification where a single target label is assigned to each input sample instance, multi-label classifi-cation refers to more than one class assignment for each input instance. Multi-label classification canaddress problems such as text-categorization, medical diagnosis, map labeling, etc. Despite being lesspopular in general, there are several methods that address multi-label classification such as [22, 47]. Inimage classification models, VGG [39] and ResNet [19] can be used as pre-trained models to extractimage features. These feature representations can later be used as input for the task of single-label ormulti-label classification.One way to address image labeling problems is to use sequence to sequence (seq2seq) [7] archi-tectures. In general, this architecture design is suitable for generating variable-length output data fromother variable-length input data. Depending on the problem, these lengths can be different from anotheror even fixed. For example, in traditional multi-label image classification the input is a fixed vector,while the output could be a variable vector of positive labels. Translation is a common applicationfor sequence to sequence models. In these tasks sentences of one language are translated into anotherlanguage.In a setting where order is not important between the prediction outputs, sequence to sequencemodels and Recurrent Neural Networks (RNNs) [29] are not an entirely suitable model as they assumean ordering and process data sequentially. One would ideally prefer a representation that deals withvariable sized sets instead. As a result, category-wise max-pooling was suggested as an addition toRNN outputs [44]. Set cardinality is the other topic related to multi-class prediction. Only recently, thenumber of set members is considered as a parameter to learn in addition to the multi-label classification4learning [27, 45]. Before this, models predicted the top k labels, and k was fixed during the process [5,14].Our model is inspired by the following architecture designs. We use a Transformer to (sequentially)generate initial ingredient predictions from the image. These outputs are then used by the (order-less)graph-based model to further refine the predictions, using a form of set reasoning implemented withneural message passing on this graph.2.2 Sequence to Sequence ModelsLong-Short-Term-Memory (LSTM) models [21] are one of the standard architecture choices for seq2seqmachine translations. LSTMs are a special kind of RNN, which are capable of remembering the im-portant parts in a sequence, and their architecture design helps to solve RNNs problems with vanishinggradients.Seq2seq models mainly consist of an encoder and a decoder. The encoder module is capable oftransforming input data into a higher-dimensional space embedding using, typically, a neural net; and adecoder transforms back encoder’s output into a sequence. This sequence can be a reconstruction of theinput, a translation, or a newly generated sequence of outputs.2.2.1 TransformerBahdanau et al. [1] proposed a novel seq2seq model called Transformers based on an attention mech-anism. The attention mechanism is a way to detect important parts of a sequence in each time-step.When an encoder reads the input at a specific time-step, the attention will look at the whole input andgenerate weight for each input token, based on its estimated importance. On the other side, the decoderwill receive the weights in addition to encoder’s output.As opposed to adding attention on top of other models like RNNs, [42] suggests stacking attentionunits combined with feed-forward layers. This architecture incorporates an attention mechanism as areplacement of RNNs to construct an entire model framework. The architecture is depicted in Figure 2.1.Among different architectures for Transformers, Multi-Headed Attention [42] is a common choiceof architecture. It has a key-value structure, a query, and a memory. A query searches for keys ofall words with a relevant context. In the case of Multi-Headed Attention, we can have several query-key-value pairs when a word provides multiple meanings and connects to several values that encodethe meaning of a keyword. Moreover, the relationship between queries to keys and keys to values islearnable. Hence, the model can change the connection between each search word and the relevantwords providing context.Later on, [12] introduced a new language representation model based on Transformers called BERT,which proves an improvement in Natural Language tasks due to the use of Transformers and withoutany RNNs.5Figure 2.1: (Left) Transformer Architecture. (Right) Mult-Head attention is composed of attentionlayers.2.3 Graph-Based ModelsGraphs can be utilized instead of forms of RNNs for problems where reasoning is done over sets or moregeneral structures not easily expressed as sequences. The general word graph refers to a set of vertices(nodes) connected through edges. The objects in a specific problem can be represented as nodes, andtheir relationships among each other can be modeled as edges. Each node can be connected to anothernode through multiple types of edges, and these edges can be directed or not.The expressive power of graphs has led to solving different problems such as node classification,link prediction, and clustering. The most typical use of graphs is in node classification, where each noderepresents a feature representation, and the task would be to predict each node’s label. Link predictionproblems aim to find whether two nodes are connected or not [28]. The objective in clustering is to finda disjoint partition of nodes where the nodes inside a cluster have stronger connections compared to theones across other clusters [41].With the rise of deep learning models [25], many works suggested the generalization of ConvolutionNeural Networks (CNNs) over structured graphs [3, 11, 20, 26] and this can be the first motivation touse graph-based neural networks. As stated in [48], the key properties of CNNs solves the problems weusually face in graphs. They also provide a local connection, which is a crucial property for graphs. Theother key feature of CNNs is the usage of shared weights, which is absent in the traditional graph-based6methods like [10]. In the following sub-sections, we give an overview of graph-based neural networks.2.3.1 Graph Neural NetworksThe Graph Neural Networks (GNNs) capablity of learning representations for graph nodes and edges isreferred to as learning graph embeddings [16]. In a traditional method, these learning steps were substi-tuted with defining hand-engineered features, which are not accurate and/or efficient. DeepWalk [34],as the first algorithm using an embedding representation, applied the SkipGram [30] model to learn anembedding of each node based on the previously generated random walks in an unsupervised manner.The main drawback of this method is the high computation costs due to the linearly growing number ofparameters per node as a result of not sharing parameters in the encoder.Most GNNs use CNNs and graph embeddings to summarize the information flowing inside the graphstructure. This property allows sharing parameters and adding dependency inside the model. Anothersignificant advantage of GNNs is order invariance. In other words, the output of GNNs is invariant tothe input order of nodes, and a graph function G has the G(PT AP) =G(A) condition. Here, A stands foradjacency matrix, and P is an arbitrary permutation. In other words GNNs are equivariant with respectto permutations.CNNs and LSTMs stack the inputs in a sequential manner, which implies some ordering. GNNs arealso a well-suited structure to represent the flow of information. The dependency between the nodes canbe presented through the edge message passing process. In neural networks, the dependency informationcounts as a feature of nodes, while GNNs can model this dependency as part of their graph structure anddo a guided message propagation by these dependencies.Applying neural networks to a graph was first introduced in [15] and then [38] as a form of RNN.As described in [48], in a node classification problem, each node v is represented as its feature vectorxv, and it is mapped to the target value label tv. The goal is to predict a d dimensional vector hv, whichprovides enough information for the node to be classified as its ground truth label. hv can be representedas follows:hv = f(xv,xco[v],hne[v],xne[v])(2.1)where xco[v], hne[v], and xne[v] are representations of the features of the edges, the states, and the featuresof the nodes in the neighborhood of v, respectively. The function f is responsible for mapping the inputdata to a d-dimensional lower space. A summary of the model message passing at time-step t is:Ht+1 = f (Ht ,X) (2.2)where H and X denote the concatenation of all the h and x, respectively. In order to compute output ovof the GNN, a fully-connected Neural Network g acts as the transition function by passing hv.ov = g(hv) (2.3)7The final step is to define the loss function and optimizing parameters via gradient descent:loss =N∑i=1(ti−oi) (2.4)where N is the number of nodes and ti is the target information on node i.With the initial success of the GNNs on different applications, a variety of new graph-based modelswere proposed. In the following sections we briefly review three of these variants.2.3.2 Graph Convolution NetworkGraph Convolution Network (GCN) [24] initially proposed the first-order propagation model whichconsisted of only one layer of convolution. This simplified version has a lower possibility of over-fittingon local neighborhood structures. As proposed by [24] the re-normalization trick is helpful to preventvanishing gradients in the model. The graph convolution generates the normalized sum of node featuresof neighbors in this model:h(t+1)i = σ(∑j∈N (i)1ci jW(l)h(l)j)(2.5)where N (i) is the number of incoming edges for node i. ci j is a normalization constant and σ is theactivation function (usually ReLU is used in GCNs).2.3.3 Graph Attention NetworkA GCN’s architecture is capable of doing the task of node classification based on a local node to nodemessage passing process. Still, generalization may not happen in some cases because of the structure-dependent characteristic of the model. To address this problem, [17] suggests averaging over a node’sneighbor features. Later, [43] suggests weighting neighbor features and changing the way the aggre-gating function works. The way node embedding of layer t + 1 is computed from the previous is asfollows:z(t)i = W(t)h(t)ie(t)i j = Leaky ReLU(a(t)T(z(t)i ‖z(t)j))α(t)i j =exp(e(t)i j)∑k∈N (i) exp(e(t)ik)h(t+1)i = σ(∑j∈N (i)α(t)i j z(t)j)(2.6)In this series of equations, a form of attention called additive attention is used. Note that ‖ isconcatenation.82.3.4 Gated Graph Neural NetworkFollowing [8] work on Gated Recurrent Units (GRUs), [26] introduced the Gated Graph Neural Network(GG-NNs) which is formulated as:zv(t) = σ (Wz · [hv(t−1),av(t)])rv(t) = σ (Wr · [hv(t−1),av(t)])h˜v(t) = tanh(W · [rv(t)∗hv(t−1),av(t)])hv(t) = (1− zv(t)) ·hv(t−1)+ zv(t) · h˜v(t)(2.7)where av is aggregated message received by the neighborhood nodes at tth iteration. GG-NNs updatefunctions on the GRUs helps the model to generate information about the current node conditioned onthe information of the other nodes and the previous steps. Using GRUs will enable the model to have along-term propagation of information across the graph structure.As mentioned before, our model is inspired by the these approaches and it is designed to work in thedomain of food data analysis. In the following section we give a description of the work done in thisdomain.2.4 Food Data ComputationRecent food datasets, such as Food 101 [2] and Recipe1M [36], provide information regarding ingredi-ents in a cooking recipe, cooking instructions and corresponding images. These datasets catalyzed newresearch ideas. The work done on food-type datasets is mostly based on neural networks as they havedominated the fields of Computer Vision and Natural Language Processing (NLP) in recent years.A recent way to predict ingredients, cutting methods and cooking methods is to use CNN networks.For instance, [4] uses this information to do recipe retrieval. In this case, the additional informationfrom cutting and cooking methods help the performance of ingredient prediction. Another approach isto learn a join representation of images and ingredients by using Deep Belief Networks (DBNs) [31]. Akey feature of this work is about the separation of visible and non-visible ingredients in modeling theproblem, which results in an improvement in the performance. Salvador et al. [36] bring up the idea ofshared space representation for image and text. They design a joint network architecture that finds theclosest text data (cooking instructions and ingredients) to the cooking image by ranking the similaritybetween the generated embeddings. This architecture can retrieve cooking recipes from pictures andalso the opposite can happen.Finally, [37] proposed the idea of recipe generation instead of retrieving them. They first generate in-gredients using an ImageNet encoder and a Transformer decoder. Then, they produce the final cookinginstructions conditioned on cooking images and the predicted ingredients using another Transformerdecoder. Their model can generate recipes which are not included in the dataset and using attentionmechanism in the shape of Transformer models is one of the critical changes in their architecture com-pared to their previous work [36]. They also showed that using this new recipe model for a retrieval taskcan improve their past work results accuracy.9Chapter 3Approach3.1 ModelImage to set prediction is a challenging task in the food computing context where ingredients comein different colors and textures, and they are generally occluded or mixed in a cooked dish. In theseproblems, the dataset includes images and a set of label pairs, and the objective is to learn the functionto accurately predict a set of labels given the image. In our domain, images are only from food dishes,and the tags are ingredients. The set of labels is a variable-sized collection of unique items with noorders.Our model, represented in Figure 3.1, is a multimodal encoder-decoder framework which is able tocapture relationships between label items and use this information to predict a set of labels conditionedon the input image. Our dataset consists of N image and label pairs{(I(i),L(i))}Ni=1 where L is a set oflabels chosen from labels dictionary D = {di}Mi=1 with size M, mapped to the image I. Note that size ofL can be any number between 0 and K where K is the max number of labels. L can also be encoded asa matrix of size K×M and the goal is to predict L˜ by maximizing the following likelihood:argmaxθ1,θLM∑i=0log p(Lˆ(i) = L(i)|x(i);θI,θT ,θG)(3.1)where θI , θL, and θG are the learning parameters of the image encoder, Transformer, and graph networksrespectively. The framework consists of a CNN-based image encoder that is responsible for extractingvisual features from the input image as an encoding vector. This vector aims to encapsulate the in-formation for all input elements to help the decoder make accurate predictions. The decoder moduleconsists of a Transformer and a GNN. Initially, the Transformer decodes the encoder output by takingan auto-regressive approach. The Transformer is composed of Transformer blocks where each blockis conditioned on the image encoding and the previously predicted ingredients. Later, each node of theGNN is initialized with the feature vectors generated by Transformer blocks, and we use graph networksto capture relationships between ingredients more explicitly. Note that we use the attention mechanismfor the GNN to focus on relevant pairs of the ingredient elements.10Figure 3.1: The model ArchitectureThe final state of each node of the graph is then fed to a fully connected prediction network followedby a So f tmax function to make a binary decision if an ingredient should be grounded to the input image.The So f tmax is accepts the fully connected network output to and generates a normalized probabilitydistribution. We pick the highest probabilities in each step for perform predicitons.It’s important to note that keeping the ingredient outputs generated by the graph in an unorderedrepresentation requires pooling over the graph nodes. During training time we use max-pooling totransform the So f tmax output matrix of size k×M to a tensor of length M. As mentioned before, Mrepresents the number of ingredient classes. This output is later used for calculating the Binary CrossEntropy loss (BCE) which is our learning objective. Binary Cross Entropy measures how far away fromthe true value (the values are binary) the prediction is for each of the classes and then averages theseclass-wise errors to obtain the final loss.We leverage formulation and components from [37] for image encoding (Section 3.1.1) and Trans-former decoding (Section 3.1.2.1). Our main technical contribution is adding the GNN in the ingredientdecoding described in Section 3.1.2.2. The model overview is depicted in Figure 1.1.In the following sections, we dive into the modules used in our model design.3.1.1 Image EncoderWe use Resnet-50 [19], initialized with pre-trained ImageNet [35], for extracting image features. Thismodel is composed of convolutional networks, and the number 50 refers to the number of layers. An ad-vantage of this model is its use of skip-connections. Skip-connections help prevent vanishing gradients,11Figure 3.2: A Transformer Block: Each block receives the output of the previous block and imageencoding, passes the inputs through layers and generates the next output.which tends to be a problem for deeper networks. The encoder transforms an input image I ∈ Ra×b×3into an embedding vector of dimensions 512 where a and b are width and height of the image. To doso, we remove the top fully-connected layer from Resnet and extract the a′×b′×2048 vector from theprevious layers (a′ and b′ are spatial dimensions of the feature map). Passing this vector to a fully-connected layer with an output size of 512 will transform the extracted vector to our desired output size,ready to be fed to the Transformer module.3.1.2 Ingredient DecoderAs discussed before, the decoder is composed of two major components: a Transformer and a GNN. Inthe following sections each of these modules is described.3.1.2.1 TransformerInspired by [1] the Transformer module acts as an auto-regressive architecture for the image-to-setprediction task. The Transformer conditions on an image to output a product of conditional probabilitieswhere each probability is dependent on the previously generated outputs and the input image. Thisarchitecture counts for dependency between labels by sequential prediction of ingredients:p(Lˆ(i)k |x(i),L(i)