Code representation learning with Prüfer sequences

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Code representation learning with Prüfer sequences Jinpa, Tenzin

Abstract

An effective and efficient code representation is critical to the success of sequence-to-sequence deep neural network models for a variety of tasks in code understanding, such as code summarization and documentations, improving productivity, and reducing software development costs. Unlike the natural language, which is unstructured and noisy, programming codes are intrinsically structured, and the learning model can leverage this property of the code. A significant challenge is to find a sequence representation that captures the structural information in the program code and facilitates the training of the models. In this study, we propose to use the Prüfer sequence of the Abstract Syntax Tree (AST) of a computer program to design a sequential representation scheme that preserves the structural information in an AST. Our representation makes it possible to develop deep-learning models in which signals carried by lexical tokens in the training examples can be exploited automatically and selectively based on their syntactic role and importance. Unlike other recently-proposed approaches, our representation is concise and lossless in terms of the structural information of the AST. To test the efficacy of Prüfer-sequence-based representation, we designed a code summarization using a sequence-to-sequence learning model on real-world benchmark datasets. The results from the empirical studies show that Prüfer-sequence-based representation is indeed highly effective and efficient, outperforming significantly all the recently-proposed deep-learning models we used as the baseline models.

Item Metadata

Title	Code representation learning with Prüfer sequences
Creator	Jinpa, Tenzin
Supervisor	Gao, Yong
Publisher	University of British Columbia
Date Issued	2021
Description	An effective and efficient code representation is critical to the success of sequence-to-sequence deep neural network models for a variety of tasks in code understanding, such as code summarization and documentations, improving productivity, and reducing software development costs. Unlike the natural language, which is unstructured and noisy, programming codes are intrinsically structured, and the learning model can leverage this property of the code. A significant challenge is to find a sequence representation that captures the structural information in the program code and facilitates the training of the models. In this study, we propose to use the Prüfer sequence of the Abstract Syntax Tree (AST) of a computer program to design a sequential representation scheme that preserves the structural information in an AST. Our representation makes it possible to develop deep-learning models in which signals carried by lexical tokens in the training examples can be exploited automatically and selectively based on their syntactic role and importance. Unlike other recently-proposed approaches, our representation is concise and lossless in terms of the structural information of the AST. To test the efficacy of Prüfer-sequence-based representation, we designed a code summarization using a sequence-to-sequence learning model on real-world benchmark datasets. The results from the empirical studies show that Prüfer-sequence-based representation is indeed highly effective and efficient, outperforming significantly all the recently-proposed deep-learning models we used as the baseline models.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2021-11-04
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0402946
URI	http://hdl.handle.net/2429/80146
Degree	Master of Science - MSc
Program	Computer Science
Affiliation	Science, Irving K. Barber Faculty of (Okanagan); Computer Science, Mathematics, Physics and Statistics, Department of (Okanagan)
Degree Grantor	University of British Columbia
Graduation Date	2022-02
Campus	UBCO
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Code representation learning with Prüfer sequences Jinpa, Tenzin

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights