Source code representation for comment generation and program comprehension

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Source code representation for comment generation and program comprehension Sharma, Rishab

Abstract

Code comment generation is the task of generating a high-level natural language description for a given code snippet. Comments help software developers maintain programs; however, comments are mostly missing or are outdated. Many studies develop models to generate comments automatically, mainly using deep neural networks. A missing point in the current research is capturing each character's information and the syntactic differences of tokens. Moreover, the contextual meaning of code tokens is generally overlooked. In this thesis, we present LAnguage Model and Named Entity Recognition Code comment generator (LAMNER-Code). A character-level language model is used to learn the semantic representation and a Named Entity Recognition model is trained for learning the code entities. These representations are used in a Neural Machine Translation architecture to produce comments. We evaluate the generated comments using our model and other baselines against ground truth on a Java dataset with four standard metrics, BLEU, ROGUE-L, METEOR, and CIDEr, which are improved by 3.26, 5.27, 1.25, and 0.1 points, respectively. The existing techniques and our proposed work are complementary to each other. Experiments on abstracted code further demonstrate the value of the LAMNER-Code embeddings. The human evaluation confirms the quality of LAMNER-Code comments compared to baselines and the reference comments. Also, the new decoder sampling strategy presented in this works can better recall identifiers during comments generation. Despite the improvement in the performance, we see performance from Transformer based models is comparable to this work. Therefore, we conduct an additional exploratory study to understand source code comprehension by the Transformer based models. The findings from this study reveal some similarities with the natural languages and also present differences in the attention of the Transformer-based language model for source code. Finally, we use the findings from this study to develop a new embedding for classification task based on identifiers which further improves the performance for code clone detection over the vanilla techniques which uses the 'CLS' token.

Item Metadata

Title	Source code representation for comment generation and program comprehension
Creator	Sharma, Rishab
Supervisor	Fard, Fatemeh Hendijani
Publisher	University of British Columbia
Date Issued	2021
Description	Code comment generation is the task of generating a high-level natural language description for a given code snippet. Comments help software developers maintain programs; however, comments are mostly missing or are outdated. Many studies develop models to generate comments automatically, mainly using deep neural networks. A missing point in the current research is capturing each character's information and the syntactic differences of tokens. Moreover, the contextual meaning of code tokens is generally overlooked. In this thesis, we present LAnguage Model and Named Entity Recognition Code comment generator (LAMNER-Code). A character-level language model is used to learn the semantic representation and a Named Entity Recognition model is trained for learning the code entities. These representations are used in a Neural Machine Translation architecture to produce comments. We evaluate the generated comments using our model and other baselines against ground truth on a Java dataset with four standard metrics, BLEU, ROGUE-L, METEOR, and CIDEr, which are improved by 3.26, 5.27, 1.25, and 0.1 points, respectively. The existing techniques and our proposed work are complementary to each other. Experiments on abstracted code further demonstrate the value of the LAMNER-Code embeddings. The human evaluation confirms the quality of LAMNER-Code comments compared to baselines and the reference comments. Also, the new decoder sampling strategy presented in this works can better recall identifiers during comments generation. Despite the improvement in the performance, we see performance from Transformer based models is comparable to this work. Therefore, we conduct an additional exploratory study to understand source code comprehension by the Transformer based models. The findings from this study reveal some similarities with the natural languages and also present differences in the attention of the Transformer-based language model for source code. Finally, we use the findings from this study to develop a new embedding for classification task based on identifiers which further improves the performance for code clone detection over the vanilla techniques which uses the 'CLS' token.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2021-12-14
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-ShareAlike 4.0 International
DOI	10.14288/1.0406070
URI	http://hdl.handle.net/2429/80449
Degree	Master of Science - MSc
Program	Computer Science
Affiliation	Science, Irving K. Barber Faculty of (Okanagan); Computer Science, Mathematics, Physics and Statistics, Department of (Okanagan)
Degree Grantor	University of British Columbia
Graduation Date	2022-02
Campus	UBCO
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-sa/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Source code representation for comment generation and program comprehension Sharma, Rishab

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights