@prefix vivo: <http://vivoweb.org/ontology/core#> .
@prefix edm: <http://www.europeana.eu/schemas/edm/> .
@prefix ns0: <https://open.library.ubc.ca/terms#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix skos: <http://www.w3.org/2009/08/skos-reference/skos.html#> .

<http://dx.doi.org/10.14288/1.0387154>
  vivo:departmentOrSchool "Science, Faculty of"@en, "Computer Science, Department of"@en ;
  edm:dataProvider "DSpace"@en ;
  ns0:degreeCampus "UBCV"@en ;
  dcterms:creator "Ghotbi, Borna"@en ;
  dcterms:issued "2019-12-16T19:06:08Z"@en, "2019"@en ;
  vivo:relatedDegree "Master of Science - MSc"@en ;
  ns0:degreeGrantor "University of British Columbia"@en ;
  dcterms:description """In this work, we address the problem of food ingredient detection from meal images, which is an intermediate step for generating cooking instructions. Although image-based object detection is a familiar task in computer vision and has been studied extensively in the last decades, the existing models are not suitable for detecting food ingredients. Normally objects in an image are explicit, but ingredients in food photos are most often invisible (integrated) and hence need to be inferred in a much more contextual manner. To this end, we explore an end-to-end neural framework with the core property of learning the relationships between ingredient pairs. We incorporate a Transformer module followed by a Gated Graph Attention Network (GGAT) to determine the ingredient list for the input dish image.
This framework encodes ingredients in a contextual yet order-less manner. Furthermore, we validate our design choices through a series of ablation studies and demonstrate state-of-the-art performance on the Recipe1M dataset."""@en ;
  edm:aggregatedCHO "https://circle.library.ubc.ca/rest/handle/2429/72773?expand=metadata"@en ;
  skos:note "Graph-based Food Ingredient DetectionbyBorna GhotbiB.Sc., Sharif University of Technology, 2017A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES(Computer Science)The University of British Columbia(Vancouver)December 2019c© Borna Ghotbi, 2019The following individuals certify that they have read, and recommend to the Faculty of Graduate andPostdoctoral Studies for acceptance, the thesis entitled:Graph-based Food Ingredient Detectionsubmitted by Borna Ghotbi in partial fulfillment of the requirements for the degree of Master of Sci-ence in Computer Science.Examining Committee:Leonid Sigal, Computer ScienceSupervisorJames J. Little, Computer ScienceSupervisory Committee MemberiiAbstractIn this work, we address the problem of food ingredient detection from meal images, which is an inter-mediate step for generating cooking instructions. Although image-based object detection is a familiartask in computer vision and has been studied extensively in the last decades, the existing models arenot suitable for detecting food ingredients. Normally objects in an image are explicit, but ingredientsin food photos are most often invisible (integrated) and hence need to be inferred in a much more con-textual manner. To this end, we explore an end-to-end neural framework with the core property oflearning the relationships between ingredient pairs. We incorporate a Transformer module followed bya Gated Graph Attention Network (GGAT) to determine the ingredient list for the input dish image.This framework encodes ingredients in a contextual yet order-less manner. Furthermore, we validateour design choices through a series of ablation studies and demonstrate state-of-the-art performance onthe Recipe1M dataset.iiiLay SummaryFood computation studies support a variety of applications and services, such as guiding the humanbehavior, improving the human health and understanding the culinary culture. Building on advances ofartificial intelligence in the last few years, we identify ingredients from meal images. This task is morechallenging compared to other object detection tasks as the ingredients are most often invisible in thefood photo. In this thesis, we propose a model which incorporates graphs to capture the relationshipsbetween ingredient pairs to enhance ingredient predictions. We evaluate different design choices byqualitatively and quantitatively evaluating performance on the Recipe1M dataset.ivPrefaceThe entire work presented here is original work done by the author, Borna Ghotbi, performed under thesupervision of Professor Leonid Sigal.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Method Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Background & Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1 Multi-Label Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Sequence to Sequence Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2.1 Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3 Graph-Based Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3.1 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3.2 Graph Convolution Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3.3 Graph Attention Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3.4 Gated Graph Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4 Food Data Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9vi3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.1.1 Image Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1.2 Ingredient Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.2 Results and Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.2.1 Repeating Ingredients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.2.2 Graphs Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.2.3 Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.2.4 Feed-Forward Networks and Auto-Regressive Networks . . . . . . . . . . . . 184.3 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 Conclusion & Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23viiList of TablesTable 4.1 A comparison between different Graph architectures and comparing the results overspecified settings. The Setting columns represent the number of layers, the use of gat-ing function, independence of layer weights, and the use of the attention mechanism,respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Table 4.2 A comparison for masking and selecting top ingredients. . . . . . . . . . . . . . . 18Table 4.3 Models accuracy comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19viiiList of FiguresFigure 1.1 Model Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Figure 2.1 (Left) Transformer Architecture. (Right) Mult-Head attention is composed of atten-tion layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Figure 3.1 The model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Figure 3.2 A Transformer Block: Each block receives the output of the previous block andimage encoding, passes the inputs through layers and generates the next output. . . 12Figure 4.1 The blue line does not perform any masking. The green line performs masking byusing −in f value. The pink line performs masking by using Min value. . . . . . . 16Figure 4.2 Masking before the Graph Network. . . . . . . . . . . . . . . . . . . . . . . . . . 17Figure 4.3 Masking after the Graph Network and prioritizing with Maximum Spanning Tree. . 18Figure 4.4 Qualitative Results. Green colored ingredients are matching ingredients betweenthe prediction and the ground truth. Red color denotes the absence of a predictedingredient in the ground truth. Black color indicates the absence of ground truthingredient in the prediction set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20ixAcknowledgmentsFirstly, I would like to express my sincere gratitude to my advisor Prof. Leonid Sigal for the continuoussupport of my Masters study and related research, for his patience, motivation, and immense knowledge.His guidance helped me in all the time of research and writing of this thesis.Besides my advisor, I would like to thank Prof. Jim Little, as the second reviewer of my thesis fortaking his time out and providing me with his insightful comments.I would also thank the department of Computer Science of University of British Columbia (UBC)for providing a great learning platform and supporting me financially as a teaching assistant. A bigthanks to all my labmates for being such great colleagues and being kind and helpful. I am also thankfulto Vector Institute for AI for providing me the financial and infrastructure support.Last but not the least, I would like to thank my family: my parents and to my sister for supportingme spiritually throughout writing this thesis and my life in general with their unconditional love.xChapter 1Introduction*A ship in harbor is safe, but that is not what ships are built for. — William G.T. Shedd1.1 OverviewFood plays a significant role in our everyday activities. We receive the nutrients from food, and theenergy allows us to stimulate growth, perform activities, work, and learn. Furthermore, healthy foodplays an important role in absorbing essential nutrients and disease prevention.Traditionally, the sources for food-related studies were limited to small-scale data such as cook-books, recipes, and questionnaires. Food perception [40], food consumption [33], food culture [18], andfood safety [6] are some of the different aspects of food analysis back in the days.With the advancements in technology and the emergence of social media, food photography has be-come a popular activity among people. These advancements have also led to an abundance of cookingwebsites and mobile applications. Following the number of resources on the Web, larger food datasetshave been introduced. In addition to food images, these new datasets provide more comprehensiveinformation regarding cooking recipes, which are composed of ingredients and necessary cooking in-structions. New research ideas and directions are shaped by taking advantage of these big-scale cookingdatasets. In the field of Computer Vision and Machine Learning, researchers are mainly concerned withgenerating suitable representations for cooking images, ingredients, and the relative cooking instruc-tions to put them together in a way that enables meaningful transition between representations. Thesetransitions are commonly used for ingredients and cooking instructions recognition and retrieval fromfood images in a multi-modal space.Classification, or recognition, tasks predict ingredients or instructions from an input image. Thisis different from retrieval, where the end goal is to search and retrieve the most similar ingredientsor instructions available in the dataset from an input image. Retrieval tasks also study the oppositeproblem where instructions are retrieved from the images. Considering how recent this field of study is,deep learning methods are the favorite choice of the researchers in the stated approaches.1Although image classification and object detection are traditional problems in Computer Vision andhave been studied vastly in the last decades, the proposed models are not ideally suitable for recognizingdishes and detecting ingredients. Normally objects in an image are explicit and detectable, like pedes-trians in a crowded street or birds flying in a blue sky. By contrast, food photos usually do not providea distinctive spatial layout or discernible appearance for ingredients, e.g., an olive inside a mixed salad.Furthermore, there are other conditions such as point of view or lightning, which make the problemdifficult.On the bright side, the color and texture of the dish can help through inferring a specific type of food.These attributes can help food recognition regardless of the lighting and other variations. As a result,the model should be able to capture local information for this specific type of classification problem.1.2 Problem DefinitionAs discussed earlier, food ingredient prediction is a challenging problem in cooking context whereingredients are not explicitly obvious in a food image. Furthermore, these ingredients are not necessarilyindependent, and there is a relationship between the elements, e.g., beef and onion often occur in thesame recipe as they are cooked together for better taste, or salt and pepper are the common add-ons toa recipe of meat or fish. Therefore we need an architecture capable of modeling dependencies betweenthe ingredients. Besides, when a chef is asked for a cooking recipe, he or she would typically providea list of ingredients as a first step. The word list suggests sequential ordering. But in fact, the items inthe list do not follow any particular order. An ordering could be imposed on the ingredients, e.g., analphabetical ordering, but such ordering is a matter of convenience for the publisher and not inherentin the recipe itself. In other words, distance between ingredients on the alphabetized list is not a goodproxy for how often such ingredients occur together or the order in which they should be combined. Asa result, designing a model capable of predicting the ingredients in an order-less but contextual mannerwould intuitively be helpful.1.3 Method OutlineWe propose an encoder-decoder architecture with the property of learning the relationships betweeningredient pairs where ordering of the ingredients is not an important parameter. We incorporate aTransformer [1] module followed by a Gated Graph Attention Network (GGAT) for our decoder. Thegraph-based module helps the Transformer to tune the parameters and reason over the prediction outputfurther. We also find the use of an attention mechanism essential to weight the edge importance betweenthe graph nodes.1.4 DataWe validate our model with the Recipe1M dataset provided by [4]. Recipe1M contains 1,029,720recipes. These recipes are collected in two stages. Initially, recipes and related images are extractedfrom a couple of cooking websites. Then, these recipes are augmented with additional cooking images2Figure 1.1: Model Overviewfrom the Web using an image search engine. Each recipe should include at least one image to be valuablefor the learning process. As a result, we use the cleaned-up version provided by [37]. In this version,recipes must have at least a single image and include more than two ingredients. Applying this filterprovides us with 252,547 training, 54,255 validation, and 54,506 test samples.The recipes provide ingredients and instructions. The number of unique ingredients across allrecipes is 1488, and the maximum number of ingredients for each recipe is 20. Instructions are fur-ther parsed into phrases, and ingredients are tokenized to a set of words, which results in 23,231 uniquewords. For experimenting with the model proposed in this thesis, we only use the cooking images andingredients. The cooking instructions can be used for instruction generation methods in the future.1.5 OutlineThe thesis is organized as follows: In Chapter 2, we review the related works and provide backgroundknowledge for sequence to sequence models and graph-based networks. Furthermore, we give anoverview of the food-based computing domain. Chapter 3, provides information regarding the archi-tecture design of our model. We discuss the details of the information flow and the training process.Chapter 4 includes our experimental results for different model settings along with the details of thedataset. We also present some qualitative results at the end of this section. Finally, in Chapter 5 wesummarize our main contributions and discuss possible future directions.3Chapter 2Background & Related WorkThe task of ingredient prediction from images lies in the domain of multi-modal learning, a rich fieldof study in Computer Vision and Deep Learning. Our emphasis is on the intersection of language andvision and therefore, this section points out relevant architectures in order to solve the problems in thesedomains. On top of that, we review the recent approaches for food analysis.2.1 Multi-Label ClassificationClassification tasks are among the popular problems in Computer Vision. In contrast to single-labelclassification where a single target label is assigned to each input sample instance, multi-label classifi-cation refers to more than one class assignment for each input instance. Multi-label classification canaddress problems such as text-categorization, medical diagnosis, map labeling, etc. Despite being lesspopular in general, there are several methods that address multi-label classification such as [22, 47]. Inimage classification models, VGG [39] and ResNet [19] can be used as pre-trained models to extractimage features. These feature representations can later be used as input for the task of single-label ormulti-label classification.One way to address image labeling problems is to use sequence to sequence (seq2seq) [7] archi-tectures. In general, this architecture design is suitable for generating variable-length output data fromother variable-length input data. Depending on the problem, these lengths can be different from anotheror even fixed. For example, in traditional multi-label image classification the input is a fixed vector,while the output could be a variable vector of positive labels. Translation is a common applicationfor sequence to sequence models. In these tasks sentences of one language are translated into anotherlanguage.In a setting where order is not important between the prediction outputs, sequence to sequencemodels and Recurrent Neural Networks (RNNs) [29] are not an entirely suitable model as they assumean ordering and process data sequentially. One would ideally prefer a representation that deals withvariable sized sets instead. As a result, category-wise max-pooling was suggested as an addition toRNN outputs [44]. Set cardinality is the other topic related to multi-class prediction. Only recently, thenumber of set members is considered as a parameter to learn in addition to the multi-label classification4learning [27, 45]. Before this, models predicted the top k labels, and k was fixed during the process [5,14].Our model is inspired by the following architecture designs. We use a Transformer to (sequentially)generate initial ingredient predictions from the image. These outputs are then used by the (order-less)graph-based model to further refine the predictions, using a form of set reasoning implemented withneural message passing on this graph.2.2 Sequence to Sequence ModelsLong-Short-Term-Memory (LSTM) models [21] are one of the standard architecture choices for seq2seqmachine translations. LSTMs are a special kind of RNN, which are capable of remembering the im-portant parts in a sequence, and their architecture design helps to solve RNNs problems with vanishinggradients.Seq2seq models mainly consist of an encoder and a decoder. The encoder module is capable oftransforming input data into a higher-dimensional space embedding using, typically, a neural net; and adecoder transforms back encoder’s output into a sequence. This sequence can be a reconstruction of theinput, a translation, or a newly generated sequence of outputs.2.2.1 TransformerBahdanau et al. [1] proposed a novel seq2seq model called Transformers based on an attention mech-anism. The attention mechanism is a way to detect important parts of a sequence in each time-step.When an encoder reads the input at a specific time-step, the attention will look at the whole input andgenerate weight for each input token, based on its estimated importance. On the other side, the decoderwill receive the weights in addition to encoder’s output.As opposed to adding attention on top of other models like RNNs, [42] suggests stacking attentionunits combined with feed-forward layers. This architecture incorporates an attention mechanism as areplacement of RNNs to construct an entire model framework. The architecture is depicted in Figure 2.1.Among different architectures for Transformers, Multi-Headed Attention [42] is a common choiceof architecture. It has a key-value structure, a query, and a memory. A query searches for keys ofall words with a relevant context. In the case of Multi-Headed Attention, we can have several query-key-value pairs when a word provides multiple meanings and connects to several values that encodethe meaning of a keyword. Moreover, the relationship between queries to keys and keys to values islearnable. Hence, the model can change the connection between each search word and the relevantwords providing context.Later on, [12] introduced a new language representation model based on Transformers called BERT,which proves an improvement in Natural Language tasks due to the use of Transformers and withoutany RNNs.5Figure 2.1: (Left) Transformer Architecture. (Right) Mult-Head attention is composed of attentionlayers.2.3 Graph-Based ModelsGraphs can be utilized instead of forms of RNNs for problems where reasoning is done over sets or moregeneral structures not easily expressed as sequences. The general word graph refers to a set of vertices(nodes) connected through edges. The objects in a specific problem can be represented as nodes, andtheir relationships among each other can be modeled as edges. Each node can be connected to anothernode through multiple types of edges, and these edges can be directed or not.The expressive power of graphs has led to solving different problems such as node classification,link prediction, and clustering. The most typical use of graphs is in node classification, where each noderepresents a feature representation, and the task would be to predict each node’s label. Link predictionproblems aim to find whether two nodes are connected or not [28]. The objective in clustering is to finda disjoint partition of nodes where the nodes inside a cluster have stronger connections compared to theones across other clusters [41].With the rise of deep learning models [25], many works suggested the generalization of ConvolutionNeural Networks (CNNs) over structured graphs [3, 11, 20, 26] and this can be the first motivation touse graph-based neural networks. As stated in [48], the key properties of CNNs solves the problems weusually face in graphs. They also provide a local connection, which is a crucial property for graphs. Theother key feature of CNNs is the usage of shared weights, which is absent in the traditional graph-based6methods like [10]. In the following sub-sections, we give an overview of graph-based neural networks.2.3.1 Graph Neural NetworksThe Graph Neural Networks (GNNs) capablity of learning representations for graph nodes and edges isreferred to as learning graph embeddings [16]. In a traditional method, these learning steps were substi-tuted with defining hand-engineered features, which are not accurate and/or efficient. DeepWalk [34],as the first algorithm using an embedding representation, applied the SkipGram [30] model to learn anembedding of each node based on the previously generated random walks in an unsupervised manner.The main drawback of this method is the high computation costs due to the linearly growing number ofparameters per node as a result of not sharing parameters in the encoder.Most GNNs use CNNs and graph embeddings to summarize the information flowing inside the graphstructure. This property allows sharing parameters and adding dependency inside the model. Anothersignificant advantage of GNNs is order invariance. In other words, the output of GNNs is invariant tothe input order of nodes, and a graph function G has the G(PT AP) =G(A) condition. Here, A stands foradjacency matrix, and P is an arbitrary permutation. In other words GNNs are equivariant with respectto permutations.CNNs and LSTMs stack the inputs in a sequential manner, which implies some ordering. GNNs arealso a well-suited structure to represent the flow of information. The dependency between the nodes canbe presented through the edge message passing process. In neural networks, the dependency informationcounts as a feature of nodes, while GNNs can model this dependency as part of their graph structure anddo a guided message propagation by these dependencies.Applying neural networks to a graph was first introduced in [15] and then [38] as a form of RNN.As described in [48], in a node classification problem, each node v is represented as its feature vectorxv, and it is mapped to the target value label tv. The goal is to predict a d dimensional vector hv, whichprovides enough information for the node to be classified as its ground truth label. hv can be representedas follows:hv = f(xv,xco[v],hne[v],xne[v])(2.1)where xco[v], hne[v], and xne[v] are representations of the features of the edges, the states, and the featuresof the nodes in the neighborhood of v, respectively. The function f is responsible for mapping the inputdata to a d-dimensional lower space. A summary of the model message passing at time-step t is:Ht+1 = f (Ht ,X) (2.2)where H and X denote the concatenation of all the h and x, respectively. In order to compute output ovof the GNN, a fully-connected Neural Network g acts as the transition function by passing hv.ov = g(hv) (2.3)7The final step is to define the loss function and optimizing parameters via gradient descent:loss =N∑i=1(ti−oi) (2.4)where N is the number of nodes and ti is the target information on node i.With the initial success of the GNNs on different applications, a variety of new graph-based modelswere proposed. In the following sections we briefly review three of these variants.2.3.2 Graph Convolution NetworkGraph Convolution Network (GCN) [24] initially proposed the first-order propagation model whichconsisted of only one layer of convolution. This simplified version has a lower possibility of over-fittingon local neighborhood structures. As proposed by [24] the re-normalization trick is helpful to preventvanishing gradients in the model. The graph convolution generates the normalized sum of node featuresof neighbors in this model:h(t+1)i = σ(∑j∈N (i)1ci jW(l)h(l)j)(2.5)where N (i) is the number of incoming edges for node i. ci j is a normalization constant and σ is theactivation function (usually ReLU is used in GCNs).2.3.3 Graph Attention NetworkA GCN’s architecture is capable of doing the task of node classification based on a local node to nodemessage passing process. Still, generalization may not happen in some cases because of the structure-dependent characteristic of the model. To address this problem, [17] suggests averaging over a node’sneighbor features. Later, [43] suggests weighting neighbor features and changing the way the aggre-gating function works. The way node embedding of layer t + 1 is computed from the previous is asfollows:z(t)i = W(t)h(t)ie(t)i j = Leaky ReLU(a(t)T(z(t)i ‖z(t)j))α(t)i j =exp(e(t)i j)∑k∈N (i) exp(e(t)ik)h(t+1)i = σ(∑j∈N (i)α(t)i j z(t)j)(2.6)In this series of equations, a form of attention called additive attention is used. Note that ‖ isconcatenation.82.3.4 Gated Graph Neural NetworkFollowing [8] work on Gated Recurrent Units (GRUs), [26] introduced the Gated Graph Neural Network(GG-NNs) which is formulated as:zv(t) = σ (Wz · [hv(t−1),av(t)])rv(t) = σ (Wr · [hv(t−1),av(t)])h˜v(t) = tanh(W · [rv(t)∗hv(t−1),av(t)])hv(t) = (1− zv(t)) ·hv(t−1)+ zv(t) · h˜v(t)(2.7)where av is aggregated message received by the neighborhood nodes at tth iteration. GG-NNs updatefunctions on the GRUs helps the model to generate information about the current node conditioned onthe information of the other nodes and the previous steps. Using GRUs will enable the model to have along-term propagation of information across the graph structure.As mentioned before, our model is inspired by the these approaches and it is designed to work in thedomain of food data analysis. In the following section we give a description of the work done in thisdomain.2.4 Food Data ComputationRecent food datasets, such as Food 101 [2] and Recipe1M [36], provide information regarding ingredi-ents in a cooking recipe, cooking instructions and corresponding images. These datasets catalyzed newresearch ideas. The work done on food-type datasets is mostly based on neural networks as they havedominated the fields of Computer Vision and Natural Language Processing (NLP) in recent years.A recent way to predict ingredients, cutting methods and cooking methods is to use CNN networks.For instance, [4] uses this information to do recipe retrieval. In this case, the additional informationfrom cutting and cooking methods help the performance of ingredient prediction. Another approach isto learn a join representation of images and ingredients by using Deep Belief Networks (DBNs) [31]. Akey feature of this work is about the separation of visible and non-visible ingredients in modeling theproblem, which results in an improvement in the performance. Salvador et al. [36] bring up the idea ofshared space representation for image and text. They design a joint network architecture that finds theclosest text data (cooking instructions and ingredients) to the cooking image by ranking the similaritybetween the generated embeddings. This architecture can retrieve cooking recipes from pictures andalso the opposite can happen.Finally, [37] proposed the idea of recipe generation instead of retrieving them. They first generate in-gredients using an ImageNet encoder and a Transformer decoder. Then, they produce the final cookinginstructions conditioned on cooking images and the predicted ingredients using another Transformerdecoder. Their model can generate recipes which are not included in the dataset and using attentionmechanism in the shape of Transformer models is one of the critical changes in their architecture com-pared to their previous work [36]. They also showed that using this new recipe model for a retrieval taskcan improve their past work results accuracy.9Chapter 3Approach3.1 ModelImage to set prediction is a challenging task in the food computing context where ingredients comein different colors and textures, and they are generally occluded or mixed in a cooked dish. In theseproblems, the dataset includes images and a set of label pairs, and the objective is to learn the functionto accurately predict a set of labels given the image. In our domain, images are only from food dishes,and the tags are ingredients. The set of labels is a variable-sized collection of unique items with noorders.Our model, represented in Figure 3.1, is a multimodal encoder-decoder framework which is able tocapture relationships between label items and use this information to predict a set of labels conditionedon the input image. Our dataset consists of N image and label pairs{(I(i),L(i))}Ni=1 where L is a set oflabels chosen from labels dictionary D = {di}Mi=1 with size M, mapped to the image I. Note that size ofL can be any number between 0 and K where K is the max number of labels. L can also be encoded asa matrix of size K×M and the goal is to predict L˜ by maximizing the following likelihood:argmaxθ1,θLM∑i=0log p(Lˆ(i) = L(i)|x(i);θI,θT ,θG)(3.1)where θI , θL, and θG are the learning parameters of the image encoder, Transformer, and graph networksrespectively. The framework consists of a CNN-based image encoder that is responsible for extractingvisual features from the input image as an encoding vector. This vector aims to encapsulate the in-formation for all input elements to help the decoder make accurate predictions. The decoder moduleconsists of a Transformer and a GNN. Initially, the Transformer decodes the encoder output by takingan auto-regressive approach. The Transformer is composed of Transformer blocks where each blockis conditioned on the image encoding and the previously predicted ingredients. Later, each node of theGNN is initialized with the feature vectors generated by Transformer blocks, and we use graph networksto capture relationships between ingredients more explicitly. Note that we use the attention mechanismfor the GNN to focus on relevant pairs of the ingredient elements.10Figure 3.1: The model ArchitectureThe final state of each node of the graph is then fed to a fully connected prediction network followedby a So f tmax function to make a binary decision if an ingredient should be grounded to the input image.The So f tmax is accepts the fully connected network output to and generates a normalized probabilitydistribution. We pick the highest probabilities in each step for perform predicitons.It’s important to note that keeping the ingredient outputs generated by the graph in an unorderedrepresentation requires pooling over the graph nodes. During training time we use max-pooling totransform the So f tmax output matrix of size k×M to a tensor of length M. As mentioned before, Mrepresents the number of ingredient classes. This output is later used for calculating the Binary CrossEntropy loss (BCE) which is our learning objective. Binary Cross Entropy measures how far away fromthe true value (the values are binary) the prediction is for each of the classes and then averages theseclass-wise errors to obtain the final loss.We leverage formulation and components from [37] for image encoding (Section 3.1.1) and Trans-former decoding (Section 3.1.2.1). Our main technical contribution is adding the GNN in the ingredientdecoding described in Section 3.1.2.2. The model overview is depicted in Figure 1.1.In the following sections, we dive into the modules used in our model design.3.1.1 Image EncoderWe use Resnet-50 [19], initialized with pre-trained ImageNet [35], for extracting image features. Thismodel is composed of convolutional networks, and the number 50 refers to the number of layers. An ad-vantage of this model is its use of skip-connections. Skip-connections help prevent vanishing gradients,11Figure 3.2: A Transformer Block: Each block receives the output of the previous block and imageencoding, passes the inputs through layers and generates the next output.which tends to be a problem for deeper networks. The encoder transforms an input image I ∈ Ra×b×3into an embedding vector of dimensions 512 where a and b are width and height of the image. To doso, we remove the top fully-connected layer from Resnet and extract the a′×b′×2048 vector from theprevious layers (a′ and b′ are spatial dimensions of the feature map). Passing this vector to a fully-connected layer with an output size of 512 will transform the extracted vector to our desired output size,ready to be fed to the Transformer module.3.1.2 Ingredient DecoderAs discussed before, the decoder is composed of two major components: a Transformer and a GNN. Inthe following sections each of these modules is described.3.1.2.1 TransformerInspired by [1] the Transformer module acts as an auto-regressive architecture for the image-to-setprediction task. The Transformer conditions on an image to output a product of conditional probabilitieswhere each probability is dependent on the previously generated outputs and the input image. Thisarchitecture counts for dependency between labels by sequential prediction of ingredients:p(Lˆ(i)k |x(i),L(i)<k)(3.2)The output is a list of ingredient features representing ingredients. The Transformer stops iteratingwhen it predicts an ”end of sentence” (eos) token or reaches the maximum number of iterations.This module is composed of an initial 1× 1 convolutional layer and multiple Transformer blocks.Each of these blocks contains two attention layers, followed by a linear layer. The first attention layer isa self-attention applied to the previous block outputs, while the other attention layer conditions attentionon the image features. Each block is followed by dropout, batch normalization, and a ReLU non-linearity. The output of each block is a 512 dimensional feature vector representing an ingredient whichis then passed to the graph network. The Transformer architecture is depicted in Figure 3.2.12One of the challenges in iterating through these blocks is repetition in output samples. We apply agreedy approach and take the maximum probability sample from the So f tmax layer to make a prediction.It is possible to predict the same ingredient in each iteration, and this would cause the model to learnincorrect parameters. The solution would be to force pre-activation of p(Lˆ(i)k |I(i),L(i)<k)to be −in f forall the previously predicted ingredients. As a result, each block would output a new ingredient at eachtime-step.3.1.2.2 GraphA graph network receives the generated information from the Transformer and applies further infor-mation propagation among the ingredients for learning pairwise relationship between nodes. Formally,a graph can be represented with a set of nodes and edges g = (V,E) where V is the set of nodes andE is the set of edges connecting the nodes. The nodes are initialized with the feature representationsextracted from the Transformer output, and the edges encode the dependencies between the roles.We use a Gated Graph Attention Network (GGAT) to model the dependencies. As there is no initialknowledge on which nodes are connected, we use a fully-connected graph. However, using the attentionmechanism will enhance the model with a dynamic graph structure. The idea is to make sure that onlyspecific edges in the graph are used in the training process, which works well when there are many noisyand uncertain edges.Given the outputs from the Transformer we first instantiate a graph with number of nodes |V |, whichis equivalent to the number of Transformer blocks. The hidden states of each node are updated in t steps.The messages propagate between the nodes can be expressed in two main functions at time-step t:xta = AGGREGATE(t)({ht−1a′ : a′ ∈Na})hta = COMBINE(t)(ht−1a ,xta) (3.3)where Na represents neighborhood nodes connecting to the node a. The choice of AGGREGATE andCOMBINE is crucial as it expresses the power of the Graph Neural Network [46]. We model the GGATAGGREGATE step as follows:xta = ∑(a,a′)∈Bαaa′ht−1a′= ∑(a,a′)∈Bαaa′Wkht−1a′(3.4)where αaa′ is the edge between nodes a and a′. We use the approach proposed by [43] to calculate theweight αaa′ :eaa′ = a(Wattnha,Wattnha′)αaa′ = so f tmax j(ei j) =exp(eaa′ )∑{a′′∈Na} exp(eaa′′ )(3.5)where Wattn is the shared attention weight. Similar to [26], we use GRU gating for the COMBINE step.After k iterations of message passing on our GGAT network, the nodes represent the final encoding ofthe ingredients, which is then decoded using a So f tmax layer. The generated vector is comparable with13a binary ground truth target. In the following section, we discuss the training process.3.2 TrainingWe train our model in four stages. Initially, we pre-train the image encoder. Then, we train the Trans-former with the pre-trained image encoder and fine-tune θI and θT . Later, we freeze the Transformerand learn θG. Finally, we unfreeze the Transformer and train the model end to end by fine-tuning θT andθG.The output from the So f tmax layer represents the final probabilities for each ingredient. We useBinary Cross-Entropy (BCE) loss between the model prediction and the ground-truth label to learn theparameters.Compared to most auto-regressive models in the literature where a fixed number of set elementsare predicted [5] we aim to have a variable-sized set collection and predict the size of this collectionfor each image. In order to do so, we incorporate two additional loss metrics. As mentioned before,the Transformer includes an end-of-sentence token (eos) in the dictionary of ingredients. This criterionlearns a stopping criterion as a BCE loss between the predicted probability for eos at all time-stepsand the ground truth. The second criterion is loss for cardinality `1 penalty [37]. Having two differentloss criteria to learn the stopping criteria is crucial as including eos inside the pooling operation willremove the necessary information on which step this token appears. As a result we use the additionalset-cardinality loss in our training process. As mentioned in the previous sections, this metric penalizesfor the difference in number of predicted ingredients. The final loss function can be written as followsas proposed by [37]:Ltrain = λ1`ingr +λ2`eos+λ3`cardinality= λ1BCE(β , βˆ )+λ2BCE(γ, γˆ)+λ3Di f f (C,Cˆ)(3.6)where β is the model prediction probabilities and βˆ is the ground truth probabilities. γ and C accountfor the eos token probability and the number of elements in the set, respectively. The Diff functioncalculates the difference in cardinality between predicted and ground-truth sets. We weight contributionof the loss metrics with three different λ hyperparameters.14Chapter 4Experiments4.1 Experimental SetupOur model is designed to accept input images and predict a set of ingredients. These images could comein different sizes. To address this, we apply pre-processing. We apply a random crop of 224× 224 oneach image and select the central 224× 224 cropped image as the final input. The final pre-processedinput is fed into a ResNet-50 model to generate the image embeddings with dimension 512.The Transformer module is composed of 4 blocks and two layers of multi-head attention with adimensionality of 256 and the output dimensionality of 512. The output of the Transformer is a vectorwith a size of 150×20×512. The first dimension is the batch size, the second is the maximum numberof ingredients, and the last one is the output embedding size. This output is then fed into the graphnetwork, which generates embedding of the same size. We transform the generated embeddings by thegraph by applying a fully connected layer with an input size of 512 and an output size of 1487. Thisoutput size matches the number of classes and helps us to perform a binary comparison with the groundtruth. The graph has only one type of edge, and the neighborhood function is defined as a 20×20 matrix.Furthermore, we initialize the weights using a Xavier initialization [13] method.The model converges after 130 epochs. Initially, we pretrained the Transformer for 50 epochs. Next,the graph is trained for 30 epochs with a freezed Transformer. We finally train the model end-to-endfor 50 epochs. All the models used in this thesis are implemented in PyTorch [32], and we train themusing the Adam optimizer [23]. We monitor the validation loss and use patience of 50 for the trainingstopping criteria. We use a batch size of 150 and a learning rate of 0.001.4.2 Results and ComparisonTo find the best model architecture for solving our specified problem, we conducted several experimentsto explore various design choices for each module. In the following sections, we compare and explainthese results.15Figure 4.1: The blue line does not perform any masking. The green line performs masking byusing −in f value. The pink line performs masking by using Min value.4.2.1 Repeating IngredientsA major problem with the sequential ingredient prediction is repetition in the predictions. To tackle thisproblem, we used −in f value to mask the previously generated ingredients vector. However, feedingthese vectors with −in f values to the graph module would generate NaN values in the gradients andmakes the learning process intractable. As a result, we proposed to take the Min value in the outputvector and use it as the masking value instead of −in f . For fair comparison, we compare the abovementioned settings only on the Transformer as it allows us to incorporate the masking value of −in f inthis module. The results show the effectiveness of masking with either −in f or Min value as there is norepeating in the sampled ingredients. The results are shown in Figure 4.1.4.2.2 Graphs ComparisonIn this section, we compare different graph settings, to assess the effect of each change in the grapharchitecture. In all these settings the Transformer is present, and the graph accepts the Transformeroutputs and generates the final results. We evaluate these models in terms of Intersection over Union(IOU) and F1 score. To do so, we calculate the accumulated counts of True Positive (TP), False Negative(FN), and False Positive (FP) over the output results and the ground truth. Table 4.1 illustrates thecomparison results. As shown in row one, we start with a simple GNN with two propagation layers.Next, in the GGNN model, we add a gating function to the node update. The third row has additionalpropagation layers compared to the simple GNN. In the simple graph settings, the propagation layershave shared weights between layers. However, we use different layer weights for each layer in the GNNwith a ”different layers” setting. The model in the fifth row incorporates an attention mechanism. Asdiscussed in the approach section, we add the attention mechanism to our model as a principal settingin our architecture. Finally, the last model includes all the mentioned settings and exploits all thesefeatures. This final model has the highest F1 score and IOU, among all the model choices. Results16Model Settings ResultsLayers Gating Not-Shared Att IOU F1GNN 2 - - - 27.13 43.67GGNN 2 X - - 27.71 43.95GNN + additional layers 5 - - - 27.16 43.74GNN + different layers 2 - X - 29.64 45.53GNN + attention 2 - - X 30.75 47.02GGNN + additional different layers + attention 5 X X X 31.62 48.13Table 4.1: A comparison between different Graph architectures and comparing the results overspecified settings. The Setting columns represent the number of layers, the use of gatingfunction, independence of layer weights, and the use of the attention mechanism, respectively.Figure 4.2: Masking before the Graph Network.also show the effectiveness of adding these settings to the model. As shown in Table 4.1, the F1 scoreincreases from 43.67 to 43.95 after adding the Gating function. However, adding layers is only givesus an additional 0.07 F1 score over the simple GNN, using non-shared layers improves the F1 score to45.53. The attention mechanism has the most significant positive influence as we see 3.62 additionalimprovement in the F1 score.4.2.3 MaskingAs stated before, we have a variable number of ingredients in each food image, and this number isusually a number less than or equal to the maximum number of ingredients, which is 20 in our case. Weuse a zero-pad masking approach to prevent redundant information flow in our network. Masking canhappen either on the Transformer output or the graph output. We compare and report results in Table 4.2by evaluating IOU and F1 values. In the first scenario, the Transformer output is masked after visitingan eos token. As a result, the graph would have 0 < s < 20 number of nodes. As an example, we havedepicted a fully-connected graph with five ingredients in Figure 4.2. Row one shows the results for thiscase.17Figure 4.3: Masking after the Graph Network and prioritizing with Maximum Spanning Tree.ResultsIOU F1Mask-before-graph 31.59 48.09Mask-after-graph / Greedy Select 27.22 42.41Mask-after-graph / Max-Span-Tree 31.62 48.13Table 4.2: A comparison for masking and selecting top ingredients.In the second scenario, we aim to mask the ingredients output vector from the graph network. Themotivation behind this approach is to have a model with an order invariant capacity. We predict 20ingredients with the Transformer and initialize the graph nodes with this output. Then, we choose thetop relevant nodes. As a graph does not embed any order in the design, we need a criteria for ingredientselection. A greedy approach would be to select the ingredient vectors with the highest confidence. Wedo this by calculating the entropy for each vector and picking the ingredients with the least entropy.The second row in Table 4.2 shows the results for this approach. In another setting for selecting topingredients, we generate a maximum spanning tree [9] based on attention weights. We have depictedthis approach in Figure 4.3. Initially, we compute attention weights between every pair of nodes andassign that as the edge weight. Attention weights count as the mutual information between the pair ofnodes. As the next step, we generate a maximum spanning tree based on the computed edge weights.Now we choose the first node with the least entropy and traverse through the tree by taking into accountthe breadth-first search. This approach is an optimal way to find a decomposition of a joint distributionof K variables in probabilistic models and, here, would give us another criteria to select the top relevantingredients. The results for this approach can be seen on the third row of Table 4.2.We find that masking and using a greedy selection does not work as well as the other two approaches.4.2.4 Feed-Forward Networks and Auto-Regressive NetworksIn this section, we compare our model with the previously proposed baseline models from the multi-labelclassification domain. Table 4.3 shows these results on the validation set for IOU and F1 scores. The firstmodel (FF) [37] is a feed-forward model trained with target distribution and sampled by thresholding18Model ResultsIOU F1FF [37] 28.84 44.11T Flist [37] 29.48 45.55T Fshu f f le [37] 27.86 43.58T Fset [37] 31.80 48.26Ours 31.91 48.15Table 4.3: Models accuracy comparison.the sum of probabilities of selected ingredients. To understand whether ingredients should be treatedas lists or sets, three different settings are proposed for training a Transformer network, as suggestedby [37]. T Flist is trained by minimizing the negative log-likelihood loss for predicting ingredients.A greedy approach to train the same Transformer with order invariance properties is to shuffle theingredients (T Fshu f f le) randomly. Moreover, T Fset exploits the auto-regressive nature of the model tocapture ingredients co-occurrence and has an order invariance property. The set and shuffle are differentin the Pooling layer discussed in Section 3.1. The Pooling layer allows a simple aggregation of theoutputs across different time-steps. Finally, our model designed with a Transformer and graph can befound on the last line. The results from our model and T Fset have close performance in IOU and F1score. Note that these results are from the validation set.4.3 Qualitative ResultsWe have visualized the predicted and the ground truth ingredients for some of our meal photos in Fig-ure 4.4. Dish images with distinctive ingredients like the Mango Salad in figure 4.4 (b) show betterresults compared to the ones with highly mixed materials. As you can observe in this case, a mango ismislabeled with a pineapple label. This mistake is due to the similar shape and color between these twofruits. Figure 4.4(f) has the highest rate of wrong predictions. The model is predicting the ingredientsconditioned on the lemon and the glass and ignoring the chicken pieces.19Figure 4.4: Qualitative Results. Green colored ingredients are matching ingredients between theprediction and the ground truth. Red color denotes the absence of a predicted ingredient in theground truth. Black color indicates the absence of ground truth ingredient in the predictionset.20Chapter 5Conclusion & Future WorkIn this work, we proposed a graph-based framework that deploys a Gated Graph Attention Neural Net-work followed up by a Transformer Network to detect ingredients from meal images. Identifying ingre-dients is a challenging task as the dishes have a lot of intraclass variability, and we find ingredients inmany different shapes and colors depending on the way that they are processed during the process ofcooking. This architecture has the core property of learning the relationships between ingredient pairsand encodes ingredients in a contextual yet order-less manner to solve the above-mentioned challenge.We performed evaluations over the Recipe1m dataset and reported the results on the IOU and F1 met-rics. These experiments show the effectiveness of using attention-based graph method with non-sharedlayers over the layers. Furthermore, we provided a solution for stopping repeating ingredients. Wefinally provided the qualitative results for this work.The thread of future works would involve improving the performance of the current work. Wewould try other architecture designs for Transformer and graph networks to compare them with ourcurrent evaluation results.Furthermore, order-less architecture is intuitive and meaningful but produces minor improvementsin ingredients in practice. This could be attributed to the complexity of the problem regarding thedetection of invisible ingredients. In the future, further improvements may be possible using specific ar-chitectural additions. The architecture would extend to generating cooking instructions. In this scenario,the instruction generator attends to both the predicted ingredients and the input image.We would also want to see if we can train the model in a weakly supervised fashion. In this scenario,we train ingredient prediction and instruction generation modules with two different datasets. We usefood images and recipes from the web in addition to paired ones to train our model. This additional datacan be from the same websites or new ones. Using data from new sources may add extra challengesto the work as the dataset properties (e.g., the average number of ingredients per recipe) would bedifferent. However, the use of two different datasets may add enhanced generalization to the model.Note that preparing a new dataset would need new preprocessing steps and additional work.Moreover, adding dish cuisine and category type to the input data may help enhance our model withadditional information. Introducing latent food categories (e.g., Asian food) would possibly direct thepredictions towards selecting ingredients related to the specific food category.21Finally, trying new loss functions aligned with the problem of set prediction (e.g., structured lossesthat could, for example, heavily penalize ingredient repetition) would be a problem to explore.22Bibliography[1] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align andtranslate. In International Conference on Learning Representations (ICLR), 2015. → pages2, 5, 12[2] L. Bossard, M. Guillaumin, and L. Van Gool. Food-101–mining discriminative components withrandom forests. In European Conference on Computer Vision, pages 446–461. Springer, 2014. →page 9[3] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. Spectral networks and locally connectednetworks on graphs. In International Conference on Learning Representations (ICLR), 2014. →page 6[4] J.-j. Chen, C.-W. Ngo, and T.-S. Chua. Cross-modal recipe retrieval with rich food attributes. InProceedings of the 25th ACM international Conference on Multimedia, pages 1771–1779. ACM,2017. → pages 2, 9[5] T. Chen, Z. Wang, G. Li, and L. Lin. Recurrent attentional reinforcement learning for multi-labelimage recognition. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. → pages5, 14[6] Z. Chen and Y. Tao. Food safety inspection using from presence to classification object-detectionmodel. Pattern Recognition, 34(12):2331–2338, 2001. → page 1[7] C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss,K. Rao, E. Gonina, et al. State-of-the-art speech recognition with sequence-to-sequence models.In 2018 IEEE international Conference on Acoustics, Speech and Signal Processing (ICASSP),pages 4774–4778. IEEE, 2018. → page 4[8] K. Cho, B. Van Merrie¨nboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, andY. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machinetranslation. In Empirical Methods in Natural Language Processing, 2014. → page 9[9] C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees.IEEE transactions on Information Theory, 14(3):462–467, 1968. → page 18[10] F. R. Chung and F. C. Graham. Spectral graph theory. Number 92. American Mathematical Soc.,1997. → page 7[11] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs withfast localized spectral filtering. In Neural Information Processing Systems (NIPS), pages3844–3852, 2016. → page 623[12] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectionaltransformers for language understanding. In North American Chapter of the Association forComputational Linguistics (NAACL), 2019. → page 5[13] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neuralnetworks. In Proceedings of the thirteenth international Conference on Artificial Intelligence andstatistics, pages 249–256, 2010. → page 15[14] Y. Gong, Y. Jia, T. Leung, A. Toshev, and S. Ioffe. Deep convolutional ranking for multilabelimage annotation. In International Conference on Learning Representations (ICLR), 2014. →page 5[15] M. Gori, G. Monfardini, and F. Scarselli. A new model for learning in graph domains. InProceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., volume 2,pages 729–734, 2005. → page 7[16] P. Goyal and E. Ferrara. Graph embedding techniques, applications, and performance: A survey.Knowledge-Based Systems, 151:78–94, 2018. → page 7[17] W. Hamilton, Z. Ying, and J. Leskovec. Inductive representation learning on large graphs. InNeural Information Processing Systems (NIPS), pages 1024–1034, 2017. → page 8[18] M. Harris. Good to eat: Riddles of food and culture. Waveland Press, 1998. → page 1[19] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages770–778, 2016. → pages 4, 11[20] M. Henaff, J. Bruna, and Y. LeCun. Deep convolutional networks on graph-structured data. arXivpreprint arXiv:1506.05163, 2015. → page 6[21] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. → page 5[22] D. J. Hsu, S. M. Kakade, J. Langford, and T. Zhang. Multi-label prediction via compressedsensing. In Advances in neural information processing systems, pages 772–780, 2009. → page 4[23] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In InternationalConference for Learning Representations (ICLR), 2015. → page 15[24] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. InInternational Conference on Learning Representations (ICLR), 2017. → page 8[25] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature 521. 2015. → page 6[26] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel. Gated graph sequence neural networks. InInternational Conference on Learning Representations (ICLR), 2016. → pages 6, 9, 13[27] Y. Li, Y. Song, and J. Luo. Improving pairwise ranking for multi-label image classification. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages3617–3625, 2017. → page 5[28] D. Liben-Nowell and J. Kleinberg. The link-prediction problem for social networks. Journal ofthe American society for information science and technology, 58(7):1019–1031, 2007. → page 624[29] T. Mikolov, M. Karafia´t, L. Burget, J. Cˇernocky`, and S. Khudanpur. Recurrent neural networkbased language model. In Eleventh annual Conference of the International SpeechCommunication Association, 2010. → page 4[30] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations invector space. In International Conference on Learning Representations (ICLR), 2013. → page 7[31] W. Min, S. Jiang, S. Wang, J. Sang, and S. Mei. A delicious recipe analysis framework forexploring multi-modal recipes with various attributes. In Proceedings of the 25th ACMinternational Conference on Multimedia, pages 402–410. ACM, 2017. → page 9[32] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017. → page 15[33] D. Pauly. A simple method for estimating the food consumption of fish populations from growthdata and food conversion experiments. Fishery Bulletin, 84(4):827–840, 1986. → page 1[34] B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning of social representations. InProceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery andData Mining, pages 701–710. ACM, 2014. → page 7[35] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. Internationaljournal of Computer Vision, 115(3):211–252, 2015. → page 11[36] A. Salvador, N. Hynes, Y. Aytar, J. Marin, F. Ofli, I. Weber, and A. Torralba. Learningcross-modal embeddings for cooking recipes and food images. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition (CVPR), pages 3020–3028, 2017. →page 9[37] A. Salvador, M. Drozdzal, X. Giro-i Nieto, and A. Romero. Inverse cooking: Recipe generationfrom food images. In Proceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR), pages 10453–10462, 2019. → pages 3, 9, 11, 14, 18, 19[38] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neuralnetwork model. IEEE Transactions on Neural Networks, 20(1):61–80, 2008. → page 7[39] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale imagerecognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2014. → page 4[40] L. B. Sørensen, P. Møller, A. Flint, M. Martens, and A. Raben. Effect of sensory perception offoods on appetite and food intake: a review of studies on humans. International journal ofobesity, 27(10):1152, 2003. → page 1[41] F. Tian, B. Gao, Q. Cui, E. Chen, and T.-Y. Liu. Learning deep representations for graphclustering. In Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014. → page 6[42] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, andI. Polosukhin. Attention is all you need. In Neural Information Processing Systems (NIPS), pages5998–6008, 2017. → page 525[43] P. Velicˇkovic´, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio. Graph attentionnetworks. In International Conference on Learning Representations (ICLR), 2018. → pages 8, 13[44] Z. Wang, T. Chen, G. Li, R. Xu, and L. Lin. Multi-label image recognition by recurrentlydiscovering attentional regions. In Proceedings of the IEEE international Conference onComputer Vision, pages 464–472, 2017. → page 4[45] S. Welleck, Z. Yao, Y. Gai, J. Mao, Z. Zhang, and K. Cho. Loss functions for multiset prediction.In Advances in Neural Information Processing Systems, pages 5783–5792, 2018. → page 5[46] K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How powerful are graph neural networks? InInternational Conference on Learning Representations (ICLR), 2018. → page 13[47] H.-F. Yu, P. Jain, P. Kar, and I. Dhillon. Large-scale multi-label learning with missing labels. InInternational Conference on Machine Learning, pages 593–601, 2014. → page 4[48] J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun. Graph neural networks:A review of methods and applications. arXiv preprint arXiv:1412.6980, 2018. → pages 6, 726"@en ;
  edm:hasType "Thesis/Dissertation"@en ;
  vivo:dateIssued "2020-05"@en ;
  edm:isShownAt "10.14288/1.0387154"@en ;
  dcterms:language "eng"@en ;
  ns0:degreeDiscipline "Computer Science"@en ;
  edm:provider "Vancouver : University of British Columbia Library"@en ;
  dcterms:publisher "University of British Columbia"@en ;
  dcterms:rights "Attribution-NonCommercial-NoDerivatives 4.0 International"@* ;
  ns0:rightsURI "http://creativecommons.org/licenses/by-nc-nd/4.0/"@* ;
  ns0:scholarLevel "Graduate"@en ;
  dcterms:title "Graph-based food ingredient detection"@en ;
  dcterms:type "Text"@en ;
  ns0:identifierURI "http://hdl.handle.net/2429/72773"@en .