With dozens of popular programming languages used worldwide, the number of sourcecode files of programs available online for public use is massive. However most blogs, forums or online Q& A websites have poor sea...
详细信息
ISBN:
(纸本)9783319089799;9783319089782
With dozens of popular programming languages used worldwide, the number of sourcecode files of programs available online for public use is massive. However most blogs, forums or online Q& A websites have poor searchability for specific programming language sourcecode. Naive thumb rules based on the file extension if any are invariably used for syntax highlighting, indentation and other ways to improve readability of the code by programming language editors. A more systematic way to identify the language in which a given source file was written would be of immense value. We believe that simple Bayesiam models would be adequate for this given the intrinsic syntactic structure of any programming language. In this paper, we present Bayesian learning models for correctly identifying the programming language in which a given piece of sourcecode was written, with high probability. We have used 20000 sourcecode files across 10 programming languages to train and test the model using the following Bayesian classifier models - Naive Bayes, Bayesian Network and Multinomial Naive Bayes. Lastly, we show a performance comparison among the three models in terms of classification accuracy on the test data.
暂无评论