UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Code representation learning with Prüfer sequences Jinpa, Tenzin

Abstract

An effective and efficient code representation is critical to the success of sequence-to-sequence deep neural network models for a variety of tasks in code understanding, such as code summarization and documentations, improving productivity, and reducing software development costs. Unlike the natural language, which is unstructured and noisy, programming codes are intrinsically structured, and the learning model can leverage this property of the code. A significant challenge is to find a sequence representation that captures the structural information in the program code and facilitates the training of the models. In this study, we propose to use the Prüfer sequence of the Abstract Syntax Tree (AST) of a computer program to design a sequential representation scheme that preserves the structural information in an AST. Our representation makes it possible to develop deep-learning models in which signals carried by lexical tokens in the training examples can be exploited automatically and selectively based on their syntactic role and importance. Unlike other recently-proposed approaches, our representation is concise and lossless in terms of the structural information of the AST. To test the efficacy of Prüfer-sequence-based representation, we designed a code summarization using a sequence-to-sequence learning model on real-world benchmark datasets. The results from the empirical studies show that Prüfer-sequence-based representation is indeed highly effective and efficient, outperforming significantly all the recently-proposed deep-learning models we used as the baseline models.

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International