Without a doubt source code search is an important software engineering activity. Empirical studies report that up to 30% of development tasks are related to search [1]. In our experience, we discovered that 36% of search queries were structural patterns, keywords or identifiers. Moreover, the most frequent search target during software maintenance is a statement.
Currently, developers use a simple text-based search engine to locate source code of interest. Alternatively, they use “where-used lists” to navigate to the target piece of source code. However, current technologies are not capable of processing fine grained structural information. For example if I compose a search query from terms, e.g., “if” and “delmdtflag”, I will retrieve a list of classes, where these two terms just occur together, but are not necessarily related. To make the query more precise, one can formulate such a query as a pattern of abstract syntax tree (AST). In this case the results will contain only those classes where “delmdtflag” is used in a clause of an “if´” statement. Until recently, the large amount of AST data and the bad performance of tree matching operations were the major show stoppers of this scenario. In our experiment we found out, that an in-memory computing engine is capable of storing and processing of a large amount of tree-shaped structural information. We analyzed a large software system (72M statements). Source code has been parsed and the resulting ASTs were stored in an in-memory column-oriented database. Since XPath is a standard query language for tree-shaped data we decided to formulate AST patterns as XPath queries. An XPath query engine has been implemented on top of the database.
Our preliminary results show that processing of even a large amount of tree-shaped structural data can be significantly accelerated. Moreover, the values in columns of the AST table repeat multiple times because the number of identifiers, literals and AST vertex types is finite and each vertex value occurs many times. Therefore, the application of compression algorithms for columns seems to be reasonable. In total the source code repository requires 8.5 GB, which is reasonable for a database of this size and granularity. In comparison, the size of original source code in text files was 3.2 GB.
For more information visit our project page.
References
[1] Timothy Lethbridge, Janice Singer, Studies of the Work Practices of Software Engineers, in Advances in Software Engineering: Comprehension, Evaluation, and EvolutionAuthor: Oleksandr Panchenko