UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Source code representation for comment generation and program comprehension Sharma, Rishab

Abstract

Code comment generation is the task of generating a high-level natural language description for a given code snippet. Comments help software developers maintain programs; however, comments are mostly missing or are outdated. Many studies develop models to generate comments automatically, mainly using deep neural networks. A missing point in the current research is capturing each character's information and the syntactic differences of tokens. Moreover, the contextual meaning of code tokens is generally overlooked. In this thesis, we present LAnguage Model and Named Entity Recognition Code comment generator (LAMNER-Code). A character-level language model is used to learn the semantic representation and a Named Entity Recognition model is trained for learning the code entities. These representations are used in a Neural Machine Translation architecture to produce comments. We evaluate the generated comments using our model and other baselines against ground truth on a Java dataset with four standard metrics, BLEU, ROGUE-L, METEOR, and CIDEr, which are improved by 3.26, 5.27, 1.25, and 0.1 points, respectively. The existing techniques and our proposed work are complementary to each other. Experiments on abstracted code further demonstrate the value of the LAMNER-Code embeddings. The human evaluation confirms the quality of LAMNER-Code comments compared to baselines and the reference comments. Also, the new decoder sampling strategy presented in this works can better recall identifiers during comments generation. Despite the improvement in the performance, we see performance from Transformer based models is comparable to this work. Therefore, we conduct an additional exploratory study to understand source code comprehension by the Transformer based models. The findings from this study reveal some similarities with the natural languages and also present differences in the attention of the Transformer-based language model for source code. Finally, we use the findings from this study to develop a new embedding for classification task based on identifiers which further improves the performance for code clone detection over the vanilla techniques which uses the 'CLS' token.

Item Citations and Data

Rights

Attribution-NonCommercial-ShareAlike 4.0 International