January 30, 2003

new code repository

This is the new site for the Source Code Data Miner project. Here is where we can put links to our status reports, documents and so on and so forth.

It has lots of features like calendars and can allow us to update what we've done on the project.

Since this is the first post, we'll start with all the information we know. Be sure to check the links on the right bar for quick access to all the currently known data.

The project description is as follows:
Comments in code, and the names of functions and variables may contain clues about their possible functionality. This application would parse a C file and associate key words in comments with their appropriate function. It also breaks up function names and expands them (like sys_memset into system memory set). Basic algorithms will be provided by customer. Implementation language must be either Java or C++.

Here is some info from Bob Waters:

WordSplitter basically does the following:
while not done:
for a given character string:
if there is an underline character, split the string into two parts
(mem_set => mem set)
else look for a captial letter (not the first character) and split
(memSet => mem set)
for each remaining string, try to split it as follows:
look up the current substring (starting at full length of word) to see if it is a word. If so use that as expansion
if not, reduce the substring length by one and try again. If a match, split the word there and continue on (memset => mem set)

for each split string try to expand it: if it is a domain term expand it via the dictionary (i.e. mem = memory sys=system) or look up in dictionary as a real word. (mem set => memory set)

Code Miner basically takes a directory and a list of terms we are looking for and does the following:
parse header comments and associate any terms with the file name.
For each function:
All comments immediately before a function definition/implementation are assumed to describe that function. Parse the comments and associate any terms with the function name. Take the function name, variables and parameters and split them. If any resolve to terms, associate those terms also. Parse any comments in the body and associate those with the function name also. Discard stop words and language keywords (if else, switch...).

Posted by seebq at January 30, 2003 03:11 AM
Comments
Post a comment