April 29, 2003
Delivery Documentation
The Delivery Documentation is available on-line at:
April 25, 2003
April 24, 2003
Done & Done II
Aparentlly the link did not work but here it is again, it tests great with fork.c which is the file Bob first gave us!!!
Sincerely,
Daniel Fernandez
Done & Done
Well kids this is it, this code removes the stopwords, expands and substitutes for function names. It's here to stay, and its fully integrated just run!!
Sincerely,
Daniel Fernandez
April 23, 2003
The stuff so far!!!
Download file
I think this works but java gives me an out of memory error!!!
April 22, 2003
Final Presentation Template
Hey kids -- here is the midterm presentation we can use as a template for the final presentatation.
Enjoy....
April 18, 2003
My part is done!!
Hi folks, I just finnished with my part. The file is called filter.java and it assumes function names and comments are placed inside vectors like the program that Matt wrote indicated.
It pulls out the stopwords first, then it expands those function names that need to be expanded and finally it replaces the words in the domain dictionary file. The files stopwords.txt and domaindic.txt are attached, see you guys tomorrow!!
April 08, 2003
Latest source code
Also includes "flow.txt" be sure to look at for specific breakdowns on who's doing what.
April 05, 2003
April 04, 2003
Meeting Minutes
OK, so first we discussed the final things needed to make our SCDM bulletproof, then we spent some brainstorming time on each segment.
First issue is the stop word removal:
we are placing the stop words in a Collection, then using Java's vector removeAll method to remove them all from the function comments vector:
func.comments.removeAll(stopWords);
Next issue is the gcc preprocessing
Using a simple:
gcc -E -C file.c
The gcc -E option performs only preprocessing. The -C option leaves all comments intact. This creates an extremely large file. We now have lots of extra comments that are not ours (from each .h included, etc.), so we need to rehash this idea, and implement it as an option, to execute preprocessor commands. We may need to simply skip many of the generated preprocessor segments, this may be easy to do since they are labelled as they are inserted.
File directories
Needs to be bulletproof - it is not creating the new directories, and is file system dependent.
Integration
We have started integrating many of the methods, however there is excess code for directory structures manipulation. We will correct this next week.
March 29, 2003
notes - a to do list
I have made the following observations, as a to do list for all. These are things we need to work on.
-directory structures for unix/windows/dos
it needs to create directories if needed
more bulletproof - can we find example of this kind of parser?
-integration of preprocessing after file read
need to do "gcc -C" preprocessing with leaving comments first
see: http://www.dis.com/gnu/gcc/gcc_14.html:
@gcctabopt{-C}
Do not discard comments. All comments are passed through to the output file, except for comments in processed directives, which are deleted along with the directive.
You should be prepared for side effects when using `-C'; it causes the preprocessor to treat comments as tokens in their own right. For example, comments appearing at the start of what would be a directive line have the effect of turning that line into an ordinary source line, since the first token on the line is no longer a `#'.
-parser quirks
bulletproof read from bracket start to bracket end
data structure creation (vector of vectors?)
-data structure stop word extraction
methods for addition and deletion
-integration!
it will take one week. This is for sure!
Right now, we have minimal file parsing, very minimal function break up, good file I/O, and wordnet functionality.
If you have any feedback, please post comments.
March 27, 2003
WordNet is done!!!
Hi there folks, well I am done with the word lookup in wordNet. The program called OpenUrl() takes in a string which is the word to be looked up on wordNet and returns a true or false depending on whether the thing is a word or not. It is as simple as that!! Test it for yourselves, just erase the main when integrating it with the rest of the code!!!
March 26, 2003
The Wordnet stuff
Well here is the code that opens the url fro .net enters a word and retrieves the results into a file. I need to do the file writing of the retrieved result and the parsing of it. I should be done tomorrow, check for it then.
All you guys have to do is say Openurl( "word to be retrieved"); and it does the rest. See ya.
March 17, 2003
Newest Source Code - March 17
This is everything. It loads files, it parses them, it adds the comments.
An explanation about parser:
1. First parseFile(File, File) is called. This opens the files... but the input file is actually completely read into the string fileString. fileIndex is the pointer to where the simulated reader is (you'll have to trace the code to get a feel for it... there are functions readChar, prevewChar, readString to help you).
2. Next dataMine() is called. This first reads in any pre-comments the file may have (comments at the top of the code before any other code). Then it loops through all the functions and reads in comments. Look at the FunctionData class to see how this is stored. The comments before and inside the function are added to the FunctionData (the name of the function is discovered in the middle of this, and after that it uses braces to find the end of the function).
Please trace through the code to see what is happening. Included is fork.c, from the linux kernel, that I used to test it out.
March 12, 2003
Updated Source Code for Loading Files
This is the newest source code for loading files. It works very well now. Right now the code loads all of the files from a directory AND its subdirectories, prints out the first line of text in the file, and opens an output file in the output directory. The program only reads '.c' files. It also hangs up when trying to open the output file in a SUB-directory, but I'll get to that later. It works fine if everything is just in one directory.
The Parser class contains static methods. The driver class (which contains the options from the command arguments) is also static, which is accessed by SCDM.driver.
February 24, 2003
First startup code
I have uploaded some beginning Java code that starts up, takes in options, and reads in files. It is framework for anyone that needs to start working on their own parts.
February 14, 2003
Design Document Round 1
Phew. Here's the Design Documentation so far.
Any revisions needed? quick!
CS 3911 - Team 11 - Design Documentation
new files from bob
bob just sent me the stopwords file and the domain dictionary. i'm going to add them to the sample_code dir on cc as well.
February 13, 2003
Hey CBQ, here's the function expander as seen on the website, adjust the format to make sure that it is consistent with the one you've already created. Call me if you have any concerns!!
Hey CBQ, here's the function expander as seen on the website, adjust the format to make sure that it is consistent with the one you've already created. Call me if you have any concerns!!
Here's the sequence diagram of the entire thing!!!!
design doc update
so cbq, dan, and I are sitting in the CoC working on the design document. I just emailed professor waters at home (hopefully) where he has a sample stop-word file and domain dictionary. when he sends those to me we might be able to compile and run his sample source, which i have conveniently put in a directory called sample_code in our cs3911a-dev dir on cc machines.
February 06, 2003
Presentation Work
Here is a link to the pdf version of our Microsoft Project generated Gannt Chart. CS 3911 - Team 11 - Project Plan.pdf.
The presentation is underway!
January 30, 2003
new code repository
This is the new site for the Source Code Data Miner project. Here is where we can put links to our status reports, documents and so on and so forth.
It has lots of features like calendars and can allow us to update what we've done on the project.
Since this is the first post, we'll start with all the information we know. Be sure to check the links on the right bar for quick access to all the currently known data.
The project description is as follows:
Comments in code, and the names of functions and variables may contain clues about their possible functionality. This application would parse a C file and associate key words in comments with their appropriate function. It also breaks up function names and expands them (like sys_memset into system memory set). Basic algorithms will be provided by customer. Implementation language must be either Java or C++.
Here is some info from Bob Waters:
WordSplitter basically does the following:
while not done:
for a given character string:
if there is an underline character, split the string into two parts
(mem_set => mem set)
else look for a captial letter (not the first character) and split
(memSet => mem set)
for each remaining string, try to split it as follows:
look up the current substring (starting at full length of word) to see if it is a word. If so use that as expansion
if not, reduce the substring length by one and try again. If a match, split the word there and continue on (memset => mem set)
for each split string try to expand it: if it is a domain term expand it via the dictionary (i.e. mem = memory sys=system) or look up in dictionary as a real word. (mem set => memory set)
Code Miner basically takes a directory and a list of terms we are looking for and does the following:
parse header comments and associate any terms with the file name.
For each function:
All comments immediately before a function definition/implementation are assumed to describe that function. Parse the comments and associate any terms with the function name. Take the function name, variables and parameters and split them. If any resolve to terms, associate those terms also. Parse any comments in the body and associate those with the function name also. Discard stop words and language keywords (if else, switch...).