Fold2Vec: A Source Code Representation for Neural Networks

Questions and issues to Francesco Bertolotti and Walter Cazzola.

A brief introduction to Fold2Vec

We introduce a novel approach to source code representation to be used in combination with neural networks. Such a representation is designed to permit the production of a continuous vector for each code statement. In particular, we present how the representation is produced in the case of Java source code.

We test our representation for three tasks: code summarization, statement separation, and code search. We compare with the state-of-the-art non-autoregressive and end-to-end models for these tasks. We conclude that all tasks benefit from the proposed representation to boost their performance in terms of f1-score, accuracy, and MRR, respectively. Moreover, we show how models trained on code summarization and models trained on statement separation can be combined to address methods with tangled responsibilities. Meaning that these models can be used to enhance code visualization and detect code misconduct.

Fold2vec Implementation is split into two packages:

fold2vec: contains the code, dataset, and models for the tasks of code summarization and statement separation.
fold2vec-search : contains the code, dataset, and models for the tasks of code search.

Code Summarization & Statement Separation

Both models, code, and the parsed dataset are archived in the fold2vec package. In the package, you will find a detailed description of how to run the experiments in the file readme.md. However, to summarize, There are three main scripts:

preprocess.sh: builds the cleaned dataset.
train.sh: trains the neural network.
test.sh: test the neural network.

The required python packages are listed under requirements.txt. There is also a Dockerfile that lists the system-wide requirements.

All these scripts depends on configuration files in python/configurations/*/*.json. You can customize the configuration directly from the command-line using configator.

There is also a colab notebook that can be used to easily test the models.

Code Search

For this task, we used the dataset CodeSearchNet. Both models, code, and the parsed dataset are archived in the fold2vec-search package. In this package, you will find a detailed description on how to run the experiments in the file readme.md. However, to summarize, There are three main scripts:

preprocess.sh: builds the cleaned dataset.
train.sh: trains the neural network.
test.sh: test the neural network.

The required python packages are listed under requirements.txt. There is also a Dockerfile that lists the system-wide requirements.

All these scripts depends on configuration files in src/python/configurations/*/*.json. You can customize the configuration directly from the command-line using configator.

There is also a colab notebook-search that can be used to easily test the models.

Fold2Vec

Walter Cazzola

Didactics

Publications

Funded Projects

Research Projects

Related Events