- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Source code representation for comment generation and...
Open Collections
UBC Theses and Dissertations
UBC Theses and Dissertations
Source code representation for comment generation and program comprehension Sharma, Rishab
Abstract
Code comment generation is the task of generating a high-level natural language description for a given code snippet. Comments help software developers maintain programs; however, comments are mostly missing or are outdated. Many studies develop models to generate comments automatically, mainly using deep neural networks. A missing point in the current research is capturing each character's information and the syntactic differences of tokens. Moreover, the contextual meaning of code tokens is generally overlooked. In this thesis, we present LAnguage Model and Named Entity Recognition Code comment generator (LAMNER-Code). A character-level language model is used to learn the semantic representation and a Named Entity Recognition model is trained for learning the code entities. These representations are used in a Neural Machine Translation architecture to produce comments. We evaluate the generated comments using our model and other baselines against ground truth on a Java dataset with four standard metrics, BLEU, ROGUE-L, METEOR, and CIDEr, which are improved by 3.26, 5.27, 1.25, and 0.1 points, respectively. The existing techniques and our proposed work are complementary to each other. Experiments on abstracted code further demonstrate the value of the LAMNER-Code embeddings. The human evaluation confirms the quality of LAMNER-Code comments compared to baselines and the reference comments. Also, the new decoder sampling strategy presented in this works can better recall identifiers during comments generation. Despite the improvement in the performance, we see performance from Transformer based models is comparable to this work. Therefore, we conduct an additional exploratory study to understand source code comprehension by the Transformer based models. The findings from this study reveal some similarities with the natural languages and also present differences in the attention of the Transformer-based language model for source code. Finally, we use the findings from this study to develop a new embedding for classification task based on identifiers which further improves the performance for code clone detection over the vanilla techniques which uses the 'CLS' token.
Item Metadata
Title |
Source code representation for comment generation and program comprehension
|
Creator | |
Supervisor | |
Publisher |
University of British Columbia
|
Date Issued |
2021
|
Description |
Code comment generation is the task of generating a high-level natural language description for a given code snippet. Comments help software developers maintain programs; however, comments are mostly missing or are outdated. Many studies develop models to generate comments automatically, mainly using deep neural networks. A missing point in the current research is capturing each character's information and the syntactic differences of tokens. Moreover, the contextual meaning of code tokens is generally overlooked. In this thesis, we present LAnguage Model and Named Entity Recognition Code comment generator (LAMNER-Code). A character-level language model is used to learn the semantic representation and a Named Entity Recognition model is trained for learning the code entities. These representations are used in a Neural Machine Translation architecture to produce comments.
We evaluate the generated comments using our model and other baselines against ground truth on a Java dataset with four standard metrics, BLEU, ROGUE-L, METEOR, and CIDEr, which are improved by 3.26, 5.27, 1.25, and 0.1 points, respectively. The existing techniques and our proposed work are complementary to each other. Experiments on abstracted code further demonstrate the value of the LAMNER-Code embeddings.
The human evaluation confirms the quality of LAMNER-Code comments compared to baselines and the reference comments. Also, the new decoder sampling strategy presented in this works can better recall identifiers during comments generation. Despite the improvement in the performance, we see performance from Transformer based models is comparable to this work. Therefore, we conduct an additional exploratory study to understand source code comprehension by the Transformer based models. The findings from this study reveal some similarities with the natural languages and also present differences in the attention of the Transformer-based language model for source code. Finally, we use the findings from this study to develop a new embedding for classification task based on identifiers which further improves the performance for code clone detection over the vanilla techniques which uses the 'CLS' token.
|
Genre | |
Type | |
Language |
eng
|
Date Available |
2021-12-14
|
Provider |
Vancouver : University of British Columbia Library
|
Rights |
Attribution-NonCommercial-ShareAlike 4.0 International
|
DOI |
10.14288/1.0406070
|
URI | |
Degree | |
Program | |
Affiliation | |
Degree Grantor |
University of British Columbia
|
Graduation Date |
2022-02
|
Campus | |
Scholarly Level |
Graduate
|
Rights URI | |
Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
Attribution-NonCommercial-ShareAlike 4.0 International