Click to See Complete Forum and Search --> : single or multiple file for indexing
karx
October 19th, 2008, 09:51 PM
im developing a new search engine with a whole new index schema. Since im focusing on performance over features, i would like to know which is better.
currently im using multiple file for fixed value attributes (eg: date, category, ..). this files is separated from inverted dictionary file.
eg field would be:
author label
location label
country label
price float
each having their own files like 'index_N.a' whereas 'N' is the id for price field.
in term of disk seeks, which is better, to include all the attribute in single file with sections (using address offset) or multiple files.
TheCPUWizard
October 19th, 2008, 09:55 PM
Do you think you can develope something that is better than the database systems which have undergone hundresd of thousands of hours of development and testing??????
Unless it is for a purely learning process, the is no reason to develop this type of code from scratch in any "Real-world" scenariuo.
Even coming up with an answer to your one specific question would take significant reasearch and use-case analysis (man-months to many-years)....
karx
October 19th, 2008, 10:03 PM
im aware of that, and my approach was tailored to my own project which is performance on low spec hardware rather than bulky features. And im using modular C code, most features currently inside the core engine. 1 of the project im looking up to is sphinxsearch but i prefer to develop my own which i can tune to perform the way i want it.
*hey even microsoft made a mistake, u know.. vista and windows 7.
sorry im not here to debate, rather to find solution to my problem here. Thx for your reply.
*just FYI, i took 6 months to developed the index schema, now im in core development stage..just so i didnt go far with the mistake i would like some input on better way to do stuff :)
TheCPUWizard
October 19th, 2008, 10:18 PM
im aware of that, and my approach was tailored to my own project which is performance on low spec hardware rather than bulky features. And im using modular C code, most features currently inside the core engine. 1 of the project im looking up to is sphinxsearch but i prefer to develop my own which i can tune to perform the way i want it.
*hey even microsoft made a mistake, u know.. vista and windows 7.
sorry im not here to debate, rather to find solution to my problem here. Thx for your reply.
1) I am not here to debate either, merely pointing out that it extremely unlikely that you can develop something as robust or performant as available offerings without spending a few YEARS doing it.
I have been a professional developer for over 30 years, and the LAST custom datastorage system (including indexing) I wrote was well over a decade ago. This includes non-windows systems and even small embedded systems.
2) Interesting that you can state "Windows 7" is a mistake, when it is still under development. Both Vista and Windows Server 2008, have been blowing away the competition when it comes to almost every "honest" metric. Every one of my large (>$100M / yr) clients [that use the Windows platform] either has made the switch or has a plan in place to migrate off of all of the other older systems (most are also commited to 100% 64-bit, since there are almost no 32 bits (only) machines in the production pipeline.
3) The analysis of even the simple 4 field example you originally posted will vary greatly depending on data patterns, hardware configuration, and many other issues.
Many modern systems will use a single large pre-allocated file with the assumption that the file location on disk is balanced for fastest access to portions of the file that are logically close.
The internal allocation of the file is broken down into pages, where a given page will contain either data (sometimes by row, other times by column) or index entries (for a specific index). The pages are internally and transparently swapped based upon usage analysis.
If the actual usage follows a set of repeatable patterns, then the internal structures will eventually align themselves so that access into the file appears as a bell curve centered around the midpoint of the time.
Depending on your OS (and possibly third party tools), this may work with or against defragmentation mechanisms which will reposition portions of the files on the drive to minimize total elevator seek times.
Those are just a very few of the considerations to review before beginning to develop such a system.
karx
October 19th, 2008, 10:37 PM
thanks for the information, that was refreshing. btw, the system will run on linux clusters with each machine suppose to handle at least 50m documents, so the url to be index will divided to all related machines. For start we will use 3 linux machines, 2 for the index/query and 1 to compile results and queries. I might merge the attibute files into single file.
*about windows7 (i never said it was a mistake, i mean they made a mistake in prev windows), why would they recode it from scratch? from what i heard Windows 7 is not compatible with prev windows meaning, old program will run on special emulator included. And why does linus made a new unix flavour for x86.. guess thats the thing that puts me among the inventors :)
*pardon my bad english, its not my 1st language
TheCPUWizard
October 19th, 2008, 10:49 PM
1) Your english is fine. SUGGESTION: Properly fill out your profile, and enable private messaging. This will greatly help with any "language" issues.
2) Given the latest information, I would still give VERY SERIOUS thought to having multiple indexing machines which are doing the parsing, and submitting buld updates to a centralized (off-the shelf) RDBMS. IT will avoid ALL of the issues about concurrency and such, then you have one or more machines that are used to retrieve the information. I have set up similar systems and it really works well.
3) To make use of new features, you really do need to rework the core of an OS. Windows 7 (based on public information) will be using a virtual environment [NOT an EMULATOR] for backwards compatability. This is very similar to how 32 bit applications are run on 32 bit operating systems.
karx
October 19th, 2008, 10:53 PM
there will be no incremental updates, the indexing will be once every month in which old index will be totally removed. My focus is more on search performance rather than database management. If you interested i could send you my draft index schema, not the latest but you will get the idea and maybe provide some insight on it.
TheCPUWizard
October 19th, 2008, 11:09 PM
there will be no incremental updates, the indexing will be once every month in which old index will be totally removed. My focus is more on search performance rather than database management. If you interested i could send you my draft index schema, not the latest but you will get the idea and maybe provide some insight on it.
See part #1 of my previous reply....... :rolleyes:
codeguru.com
Copyright Internet.com Inc., All Rights Reserved.