Click to See Complete Forum and Search --> : Fast Deletion Algorithm for Large Scale Duplicated Web Pages
suchuhui80
October 9th, 2006, 01:57 AM
recently I am reading an Algorithm for deleting duplicated web-pages,
I am reading the article which introduces a good algorithm with precision and efficiency.But I don't understand how the algorithm find a way to match the feature code . Let me show an example :
The are two web-pages with the feature code of "A hero with long-bow shot the dragon" and "Hero with longbow show the dragon.".In this case two of this web-pages are treated as the same page,but if I get the first feature code indexed in B-TREE first , afterwards I get the second feature code search in that B-TREE finding out that it can't be found ,as the second page will treated as a different page .That's out of my expectation.
is there any one familiar with "Deleting duplicated document" ?
sorry,problem is still not resolved.
cilu
October 9th, 2006, 02:12 AM
[ redirected thread ]
RoboTact
October 10th, 2006, 01:04 PM
recently I am reading an Algorithm for deleting duplicated web-pages,
I am reading the article which introduces a good algorithm with precision and efficiency.But I don't understand how the algorithm find a way to match the feature code . Let me show an example :
The are two web-pages with the feature code of "A hero with long-bow shot the dragon" and "Hero with longbow show the dragon.".In this case two of this web-pages are treated as the same page,but if I get the first feature code indexed in B-TREE first , afterwards I get the second feature code search in that B-TREE finding out that it can't be found ,as the second page will treated as a different page .That's out of my expectation.
is there any one familiar with "Deleting duplicated document" ?
sorry,problem is still not resolved.Posting link to that article will probably help. What is the feature code? Is there a small number of feature code strings per file? What is the definition of such strings' equivalence?
MikeAThon
October 10th, 2006, 05:11 PM
... and, the two feature codes that you posted are not equal:
"A hero with long-bow shot the dragon"
"Hero with longbow show the dragon."
They are close, but they are not equal, such that (depending on the algorithm) it might not be a surprise that the algorithm determines that the pages are different.
Mike
RoboTact
October 10th, 2006, 05:24 PM
"A hero with long-bow shot the dragon"
"Hero with longbow show the dragon."[/code]
They are close, but they are not equal, such that (depending on the algorithm) it might not be a surprise that the algorithm determines that the pages are different.Sure he means equivalency of some sort. But it's not a problem if such equivalency is not overly complicated - more challenging problem is dealing with all phrases on all pages. Which in turn requires mere definite bounds on number of pages, number of phrases on each page and allowed amount of time co compute requested things.
suchuhui80
October 11th, 2006, 02:33 AM
thanks.
to determind two of the feature code is the same or not is not so hard .
let me show you .
A hero with long-bow shot the dragon
Hero with longbow shot the dragon.
if the "longest mutual element" (5 ,only 'A' and 'long-bow' are different)of the two feature-code is large than the length of the feature code -2 ( 4 ),then they are treated as the same feature-code.
If there are small scale of web-pages to compare ,problem will be resolved,
I can compare the feature-code using the algorithm above to the web-pages one by one ,but indeed the amount of web-pages to compare are very huge ,I need to put the feature-code into a b-tree , but If I do so how can I find the closest feature-code in b-tree ,that 's impossible . Maybe I should consider HASH instead , but is it possible for hash function to find the closest feature-code ? thanks ?
RoboTact
October 11th, 2006, 03:26 AM
Please answer specifically questions I asked:bounds on number of pages, number of phrases on each page and allowed amount of time to compute requested things.plus what exactly should be computed.
suchuhui80
October 11th, 2006, 11:23 PM
bounds on number of pages, number of phrases on each page and allowed amount of time to compute requested things.
Please answer specifically questions I asked:plus what exactly should be computed.
Thank you so much.
It's kind of embarrsing for me to ask such question here ,maybe you may ask me why you go buy a concerning book to seek the solution first,but I really did but gained no harvest .
1.bounds on number of pages:
my company is developing a search-engine .
one of the big issue is to delete a large scale of duplicated webpages (maybe to deal with 100 million pages) .
2.number phrases on each page.
I haven't counted it yet, but each page has approximately 4 KB in size.
3.amount of time to compute requested things
if you are not saying "times" .
-- 1.How long it take to insert the feature code depends on what data-algorithm I adopt (two data-algorithm I can adopt ,BT-TREE OR HASH-MAP) .Which one I adopt is not a problem, both of them are fast (HASH-MAP is more fast if the amount of webpage is fixed).
-- 2.Compareing two of the feature-code is not slow as well.The feature-code is not so long ,each of which has 20 bytes precisely .Compare the short string is not quite time-comsuing.
4.What exactly I need to compute .
-- 1.insert feature-code into B-TREE or HASH-MAP .
-- 2.compare two of feature-code of which two of web-pages is potentially
the same.
the latter maybe is optional ,but lacking that will cause the problem that my algoritm would not delete the duplicated web-page precisely. as my upper demonstration show,two of the feature-code is different while two of the associated web-page are the same .
the tough-nut is I need to insert the the feature-code (that "hero" feature-code ) into B-TREE or HASH-MAP in order ,and find a way to indicate
that two of the feature-code is the same using the second said method. implementing two of this separately is easy,mergeing two of them is hard ,
one needs accurate match while the other needs woozy match !
RoboTact
October 12th, 2006, 02:48 AM
So each page contains hundreds of these 'feature codes' (?). How do you decide pages are 'the same'? If they contain at least one pair of equivalent feature codes? Seems irrelevant.
Definition of equivalency:
if the "longest mutual element" (5 ,only 'A' and 'long-bow' are different)of the two feature-code is large than the length of the feature code -2 ( 4 ),then they are treated as the same feature-code.
So as bottom line what you really need is dictionary matching with up to 2 errors, number of words about 10^10, size of alphabet is size of natural language dictionary, that is about 10^4 if you cut uncommon words.
http://www.google.com/search?hl=en&lr=&q=dictionary+matching+errors&btnG=Search
codeguru.com
Copyright Internet.com Inc., All Rights Reserved.