Click to See Complete Forum and Search --> : Looking for tool to find out similarity in coding


kyoko
February 25th, 2004, 05:51 PM
I am looking for a tool that can compare two text file (program source code) and tell me the percentage of similarity between the two codes. My purpose is to find code cheatings in my class.

Any help will be appreciated!

Thank you very much.

Kheun
February 25th, 2004, 08:02 PM
How about using windiff.exe that comes together with the installation of VC++6. It is a tool for comparing the differences between two files and it highlights the differences. However, I don't think that there is any indicator for displaying the percentage differences.

If not, you may like to see at this link (http://download.com.com/3120-20-0.html?qt=diff&tg=dl-2001). The result is generated by "diff" keyword search in www.download.com.

kyoko
February 25th, 2004, 09:04 PM
Thanks very much for your info. But it seems that windiff.exe only does comparisons based by text. It cannot detect the similarity in the code structure. In other words, if the two source code files have the same structure, but use different variable names, then it wouldn't be able to detect that they are similar. Also it would be nicer if there is a unix/linux tool and text based program, so that I can make a script and compare a large amount of files in a single command.

By the way, the program code I am targeting is going to be C programs.

Thanks again.

SolarFlare
February 25th, 2004, 10:27 PM
There was an article about this in Scientific American a little while back. One of the major difficulties in scanning this type of thing is that when the documents are of different lengths, it is difficult to find which parts "correspond" and thus compare them. I don't know of any tool which does this for code (which would be simpler than straight text, I assume), but if you look into this resource you may be able to write your own.

SolarFlare
February 25th, 2004, 10:28 PM
The article is from June 2003 Scientific American, called "Chain Letters and Evolutionary Histories" by Charles H Bennett, Ming Li and Bin Ma. It does not focus on programming specifically but the concept is identical.

[edit: I suppose I should clarify. The authors in question used the type of analysis you are requesting on a series of related chain letters and then applied that information to create a type of taxonomy of order... I suppose you can see how it relates -- also good reading]