Click to See Complete Forum and Search --> : Text chunks differencing


nuksip
March 19th, 2009, 02:47 PM
Hi all,

We have the text files that we need to diff. using standard GNU tool wdiff that is based on O(ND) Difference algorithm. But the main pain that this documents were formed from extracted text chunks.
When text chunks are extracted they have totally different order and wdiff tool picking up too many changes. Let me give an example.

Base file:

Text was extracted. Many chunks was formed and merged together.

File with extracted chunks:

| Text | Many chunks | and merged . | were extracted. | was formed | together.

Chunks are splitted by |.

The task is to find the algorithm that can reorder the chunks to get as less edits as possible.

If I reorder extracted text chunks like this:

| Text | were extracted. | Many chunks | was formed | and merged . | together.

there will be only a few edits.

I tried to do brute force approach like:

1) enumerate all chunks combination's for ex: 1, 2, 3; 1, 3, 2; 3, 1, 2....
1.1) Form the string from chunks
1.2) Get longest common subsequence from two strings
1.3) Save best solution, found so far

This approach takes n! * O(LCS) time, where n is the number of chunks. Can someone suggest a faster way to do this?