SHARE

Introduction to Lucene.Net

What is Lucene.Net? Lucene.Net is an exact port of the original Lucene search engine library, written in C#. It provides a framework (APIs) for creating applications with full text search. Lucene.Net can be downloaded from http://incubator.apache.org/lucene.net/download.html. Currently it is undergoing incubation at Apache Software Foundation (ASF). Why Use Lucene.Net? You can use Lucene.Net to add […]

Written By

CodeGuru Staff

Jan 18, 2012

3 minute read

CodeGuru content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More

What is Lucene.Net?

Lucene.Net is an exact port of the original Lucene search engine
library, written in C#. It provides a framework (APIs) for creating
applications with full text search.

Lucene.Net can be downloaded from http://incubator.apache.org/lucene.net/download.html.
Currently it is undergoing incubation at Apache Software Foundation (ASF).

Why Use Lucene.Net?

You can use Lucene.Net to add more power to an already existing
search in your ASP.Net web application or website. It can also be used to index
and search documents (word, pdf, etc.) within your application.

This article describes how we can use Lucene.Net to add full
text search in our ASP.Net applications. Any search function consists of two
basic steps, first to index the text and second to search the text. We will use
Lucene.Net to do both of the steps.

In this example we will try to read the content of a text file
and index it using Lucene.Net. First download the dll and add a reference to
the project.

How to Use Lucene.Net

Indexing the text

There are a few things to understand before we start indexing.

1. Analyzer
– To read the text and break them into words (Tokens). Can also be used to
remove ‘noise words’ (common words which you would not want to index).

2. Fields
– Content holders with a name and a value.

3. Documents
– The unit of indexing and search. Is a collection of fields. Documents are
added to the index and are returned as a list of results.

4. Index
– is a collection of documents.

5. IndexWriter
– Writes the document to the index file.

Code for creating the index file

string strIndexDir = @"D:Index";
Lucene.Net.Store.Directory indexDir = Lucene.Net.Store.FSDirectory.Open(new System.IO.DirectoryInfo(strIndexDir));
Analyzer std = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29); //Version parameter is used for backward compatibility. Stop words can also be passed to avoid indexing certain words
IndexWriter idxw = new IndexWriter(indexDir, std, true, IndexWriter.MaxFieldLength.UNLIMITED); //Create an Index writer object.
Lucene.Net.Documents.Document doc = new Lucene.Net.Documents.Document();
Lucene.Net.Documents.Field fldText = new Lucene.Net.Documents.Field("text", System.IO.File.ReadAllText(@"d:test.txt"), Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.ANALYZED, Lucene.Net.Documents.Field.TermVector.YES);
doc.Add(fldText);
//write the document to the index
idxw.AddDocument(doc);
//optimize and close the writer
idxw.Optimize();
idxw.Close();
Response.Write("Indexing Done");

Parameters passed while adding Field are:

1. Lucene.Net.Documents.Field.Store. YES
– Field is stored in the index and would be returned in search results. Passing NO
would not store the field in the index and would not be shown in the results.

2. Lucene.Net.Documents.Field.Index. ANALYZED
– Field can be searched. NO means it will not be searchable. NOT_ANALYZED means
field would be searched but analyzer is not used.

3. Lucene.Net.Documents.Field.TermVector. YES
– Stores list of terms and number of occurrences (Google to understand
TermVector more).

It is recommended to call the IndexWriter.Optimize() on
completion of the indexing. It “optimizes” the index for the fastest possible
search.

First part of indexing the text is completed. We will now
search the index for the text entered in the textbox.

Search the text

We will build the search query using the QueryParser class.
There are more Query classes available in Lucene.Net, such as TermQuery,
RangeQuery, etc., which can be used for different requirements. To create a
search query we need use the Analyzer object and the field in the index to
search in.

string strIndexDir = @"D:Index";
Analyzer std = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29);
Lucene.Net.QueryParsers.QueryParser parser = new Lucene.Net.QueryParsers.QueryParser(Lucene.Net.Util.Version.LUCENE_29, "text", std);
Lucene.Net.Search.Query qry = parser.Parse(Search.Text);

After creating the query object we will use the IndexReader object
for opening the index in read only mode.

Lucene.Net.Store.Directory directory = Lucene.Net.Store.FSDirectory.Open(new System.IO.DirectoryInfo(strIndexDir)); //Provide the directory where index is stored
Lucene.Net.Search.Searcher srchr = new Lucene.Net.Search.IndexSearcher(Lucene.Net.Index.IndexReader.Open(directory, true));//true opens the index in read only mode

Lucene.Net stores the search results (documents) in
Collectors. There are different Collectors available in Lucene.Net. In this
example we will use “TopScoreDocCollector,” which sorts the results based on
athe number of occurrences in each document. Create method of “TopScoreDocCollector”
accepts two parameters – maximum number of documents required (int) and whether
to sort the docs by score.

TopScoreDocCollector cllctr = TopScoreDocCollector.create(100, true);

Once the collector object is ready we will perform the
search and get the results from the collector in a ScoreDoc array.

ScoreDoc[] hits = cllctr.TopDocs().scoreDocs;
for (int i = 0; i < hits.Length; i++)
{

int docId = hits[i].doc;
float score = hits[i].score;
Lucene.Net.Documents.Document doc = srchr.Doc(docId);
Response.Write("Searched from Text: " + doc.Get("text"));
}

This is just an introduction to Lucene.Net. There are a lot
of other areas to be explored, such as different Analyzers, QueryParsers,
Collectors, etc.

Happy learning.