April 29, 2003

Here is the users manual

Download file

Posted by Daniel Fernandez at 01:21 PM | Comments (0)

Delivery Documentation

The Delivery Documentation is available on-line at:

Delivery Documentation.

Posted by seebq at 12:40 AM | Comments (0)

April 25, 2003

New Final Presentation

Download file

Posted by Evan Wheeler at 10:48 AM | Comments (0)

April 24, 2003

Done & Done II

Aparentlly the link did not work but here it is again, it tests great with fork.c which is the file Bob first gave us!!!

Sincerely,

Daniel Fernandez

Download file

Posted by Daniel Fernandez at 05:41 PM | Comments (0)

Done & Done

Well kids this is it, this code removes the stopwords, expands and substitutes for function names. It's here to stay, and its fully integrated just run!!

Sincerely,

Daniel Fernandez

Download file

Posted by Daniel Fernandez at 05:37 PM | Comments (0)

April 23, 2003

The stuff so far!!!

Download file
I think this works but java gives me an out of memory error!!!

Posted by Daniel Fernandez at 01:09 PM | Comments (0)

April 22, 2003

Final Presentation Template

Hey kids -- here is the midterm presentation we can use as a template for the final presentatation.

Midterm Presentation.

Enjoy....

Posted by Charles Brian Quinn at 07:51 PM | Comments (0)

April 18, 2003

My part is done!!

Hi folks, I just finnished with my part. The file is called filter.java and it assumes function names and comments are placed inside vectors like the program that Matt wrote indicated.

It pulls out the stopwords first, then it expands those function names that need to be expanded and finally it replaces the words in the domain dictionary file. The files stopwords.txt and domaindic.txt are attached, see you guys tomorrow!!

Download file

Posted by Daniel Fernandez at 11:07 AM | Comments (0)

April 08, 2003

Latest source code

Also includes "flow.txt" be sure to look at for specific breakdowns on who's doing what.

Download file

Posted by Matt Quigley at 09:55 PM | Comments (0)

April 05, 2003

DO NOT ERASE

Download file

Posted by Daniel Fernandez at 09:39 PM | Comments (0)

dont erase

Download file

Posted by Daniel Fernandez at 09:38 PM | Comments (0)

Download file

Posted by Daniel Fernandez at 09:37 PM | Comments (0)

April 04, 2003

Meeting Minutes

OK, so first we discussed the final things needed to make our SCDM bulletproof, then we spent some brainstorming time on each segment.

First issue is the stop word removal:
we are placing the stop words in a Collection, then using Java's vector removeAll method to remove them all from the function comments vector:

func.comments.removeAll(stopWords);

Next issue is the gcc preprocessing
Using a simple:

gcc -E -C file.c

The gcc -E option performs only preprocessing. The -C option leaves all comments intact. This creates an extremely large file. We now have lots of extra comments that are not ours (from each .h included, etc.), so we need to rehash this idea, and implement it as an option, to execute preprocessor commands. We may need to simply skip many of the generated preprocessor segments, this may be easy to do since they are labelled as they are inserted.

File directories
Needs to be bulletproof - it is not creating the new directories, and is file system dependent.

Integration
We have started integrating many of the methods, however there is excess code for directory structures manipulation. We will correct this next week.

Posted by Charles Brian Quinn at 06:06 PM | Comments (0)

March 29, 2003

notes - a to do list

I have made the following observations, as a to do list for all. These are things we need to work on.

-directory structures for unix/windows/dos
it needs to create directories if needed
more bulletproof - can we find example of this kind of parser?

-integration of preprocessing after file read
need to do "gcc -C" preprocessing with leaving comments first
see: http://www.dis.com/gnu/gcc/gcc_14.html:

@gcctabopt{-C}
Do not discard comments. All comments are passed through to the output file, except for comments in processed directives, which are deleted along with the directive.
You should be prepared for side effects when using `-C'; it causes the preprocessor to treat comments as tokens in their own right. For example, comments appearing at the start of what would be a directive line have the effect of turning that line into an ordinary source line, since the first token on the line is no longer a `#'.

-parser quirks

bulletproof read from bracket start to bracket end
data structure creation (vector of vectors?)

-data structure stop word extraction
methods for addition and deletion

-integration!
it will take one week. This is for sure!

Right now, we have minimal file parsing, very minimal function break up, good file I/O, and wordnet functionality.

If you have any feedback, please post comments.

Posted by Charles Brian Quinn at 04:15 PM | Comments (0)

March 27, 2003

WordNet is done!!!

Hi there folks, well I am done with the word lookup in wordNet. The program called OpenUrl() takes in a string which is the word to be looked up on wordNet and returns a true or false depending on whether the thing is a word or not. It is as simple as that!! Test it for yourselves, just erase the main when integrating it with the rest of the code!!!

Download file

Posted by Daniel Fernandez at 05:04 PM | Comments (0)

March 26, 2003

The Wordnet stuff

Well here is the code that opens the url fro .net enters a word and retrieves the results into a file. I need to do the file writing of the retrieved result and the parsing of it. I should be done tomorrow, check for it then.

All you guys have to do is say Openurl( "word to be retrieved"); and it does the rest. See ya.

Download file

Posted by Daniel Fernandez at 05:07 PM | Comments (0)

March 17, 2003

Newest Source Code - March 17

This is everything. It loads files, it parses them, it adds the comments.

An explanation about parser:
1. First parseFile(File, File) is called. This opens the files... but the input file is actually completely read into the string fileString. fileIndex is the pointer to where the simulated reader is (you'll have to trace the code to get a feel for it... there are functions readChar, prevewChar, readString to help you).
2. Next dataMine() is called. This first reads in any pre-comments the file may have (comments at the top of the code before any other code). Then it loops through all the functions and reads in comments. Look at the FunctionData class to see how this is stored. The comments before and inside the function are added to the FunctionData (the name of the function is discovered in the middle of this, and after that it uses braces to find the end of the function).

Please trace through the code to see what is happening. Included is fork.c, from the linux kernel, that I used to test it out.

Download file

Posted by Matt Quigley at 07:36 PM | Comments (0)

March 12, 2003

Updated Source Code for Loading Files

This is the newest source code for loading files. It works very well now. Right now the code loads all of the files from a directory AND its subdirectories, prints out the first line of text in the file, and opens an output file in the output directory. The program only reads '.c' files. It also hangs up when trying to open the output file in a SUB-directory, but I'll get to that later. It works fine if everything is just in one directory.

The Parser class contains static methods. The driver class (which contains the options from the command arguments) is also static, which is accessed by SCDM.driver.

Download new file

Posted by Matt Quigley at 02:46 AM | Comments (0)

February 24, 2003

First startup code

I have uploaded some beginning Java code that starts up, takes in options, and reads in files. It is framework for anyone that needs to start working on their own parts.

Download file

Posted by Matt Quigley at 03:13 AM | Comments (0)

February 14, 2003

Design Document Round 1

Phew. Here's the Design Documentation so far.

Any revisions needed? quick!

CS 3911 - Team 11 - Design Documentation

Posted by Charles Brian Quinn at 03:55 PM | Comments (0)

new files from bob

bob just sent me the stopwords file and the domain dictionary. i'm going to add them to the sample_code dir on cc as well.

Posted by Ali Reza Bahmanyar at 01:02 PM | Comments (0)

February 13, 2003

Hey CBQ, here's the function expander as seen on the website, adjust the format to make sure that it is consistent with the one you've already created. Call me if you have any concerns!!

Download file

Posted by Daniel Fernandez at 09:44 PM | Comments (0)

Hey CBQ, here's the function expander as seen on the website, adjust the format to make sure that it is consistent with the one you've already created. Call me if you have any concerns!!

Download file

Posted by Daniel Fernandez at 09:44 PM | Comments (0)

Here's the sequence diagram of the entire thing!!!!

View image

Posted by Daniel Fernandez at 09:10 PM | Comments (0)

urgent

help, i've fallen and i can't get up...

Posted by Matt Quigley at 08:38 PM | Comments (0)

design doc update

so cbq, dan, and I are sitting in the CoC working on the design document. I just emailed professor waters at home (hopefully) where he has a sample stop-word file and domain dictionary. when he sends those to me we might be able to compile and run his sample source, which i have conveniently put in a directory called sample_code in our cs3911a-dev dir on cc machines.

Posted by Ali Reza Bahmanyar at 08:35 PM | Comments (0)

February 06, 2003

Presentation Work

Here is a link to the pdf version of our Microsoft Project generated Gannt Chart. CS 3911 - Team 11 - Project Plan.pdf.

The presentation is underway!

Posted by Charles Brian Quinn at 08:40 PM | Comments (0)

January 30, 2003

new code repository

This is the new site for the Source Code Data Miner project. Here is where we can put links to our status reports, documents and so on and so forth.

It has lots of features like calendars and can allow us to update what we've done on the project.

Since this is the first post, we'll start with all the information we know. Be sure to check the links on the right bar for quick access to all the currently known data.

The project description is as follows:
Comments in code, and the names of functions and variables may contain clues about their possible functionality. This application would parse a C file and associate key words in comments with their appropriate function. It also breaks up function names and expands them (like sys_memset into system memory set). Basic algorithms will be provided by customer. Implementation language must be either Java or C++.

Here is some info from Bob Waters:

WordSplitter basically does the following:
while not done:
for a given character string:
if there is an underline character, split the string into two parts
(mem_set => mem set)
else look for a captial letter (not the first character) and split
(memSet => mem set)
for each remaining string, try to split it as follows:
look up the current substring (starting at full length of word) to see if it is a word. If so use that as expansion
if not, reduce the substring length by one and try again. If a match, split the word there and continue on (memset => mem set)

for each split string try to expand it: if it is a domain term expand it via the dictionary (i.e. mem = memory sys=system) or look up in dictionary as a real word. (mem set => memory set)

Code Miner basically takes a directory and a list of terms we are looking for and does the following:
parse header comments and associate any terms with the file name.
For each function:
All comments immediately before a function definition/implementation are assumed to describe that function. Parse the comments and associate any terms with the function name. Take the function name, variables and parameters and split them. If any resolve to terms, associate those terms also. Parse any comments in the body and associate those with the function name also. Discard stop words and language keywords (if else, switch...).

Posted by seebq at 03:11 AM | Comments (0)