Institutional-Repository, University of Moratuwa.  

Analyzing source code identifiers for code reuse using NLP techniques and wordnet

Show simple item record

dc.contributor.author Pirapuraj, P
dc.contributor.author Perera, I
dc.date.accessioned 2018-08-01T21:20:22Z
dc.date.available 2018-08-01T21:20:22Z
dc.date.issued 2017
dc.identifier.uri http://dl.lib.mrt.ac.lk/handle/123/13350
dc.description.abstract Massive amount of source codes are available free and open. Reusing those open source codes in projects can reduce the project duration and cost. Even though several Code Search Engines (CSE) are available, finding the most relevant code can be challenging. In this paper we propose a framework that can be used to overcome the above said challenge. The proposed solution starts with a Software Architecture (Class Diagram) in XML format and extracts information from the XML file, and then, it fetches relevant projects using three types of crawlers from GitHub, SourceForge, and GoogleCode. Then it finds the most relevant projects among the vast amount of downloaded projects. This research considers only Java projects. All java files in every project will be represented in Abstract Syntax Tree (AST) to extract identifiers (class names, method names, and attributes name) and comments. Action words (verbs) are extracted from comments using Part of Speech technique (POS). Those identifiers and XML file information need to be analyzed for matching. If identifiers are matched, marks will be given to those identifiers, likewise marks will be added together and then if the total mark is greater than 50%, the .java file will be considered as a relevant code. Otherwise, WordNet will be used to get synonym of those identifiers and repeat the matching process using those synonyms. For connected word identifiers, camel case splitter and N-gram technique are used to separate those words. The Stanford Spellchecker is used to identify abbreviated words. The results indicate successful identification of relevant source codes. en_US
dc.language.iso en en_US
dc.subject Software Architecture; Class Diagram; WordNet; N-gram technique; PoS Tagging; Sourcecode identification; Code reuse en_US
dc.title Analyzing source code identifiers for code reuse using NLP techniques and wordnet en_US
dc.identifier.faculty Engineering en_US
dc.identifier.department Department of Computer Science and Engineering en_US
dc.identifier.year 2017 en_US
dc.identifier.conference Moratuwa Engineering Research Conference - MERCon 2017 en_US
dc.identifier.place Moratuwa, Sri Lanka en_US
dc.identifier.email pirapu@cse.mrt.ac.lk en_US
dc.identifier.email indika@cse.mrt.ac.lk en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record