Wednesday, November 3, 2010

Lucene Search in Liferay

Lucene Search  In Liferay

What is Lucene?

Lucene powered by Akamai EdgeComputing is based on Jakarta Lucene, an open-source search application created by the Apache Software Foundation.

Jakarta Lucene is a high-performance, full-featured text search engine library written entirely in Java. The engine uses powerful, accurate, and efficient search algorithms. (http://jakarta.apache.org/lucene). Lucene includes the following functionality:

  • Provides the ability to search in many languages
  • Indexes any text-based file, such as HTML, or any file that can be converted to text
  • Supports ranked searching so the best results are returned first
  • Performs boolean and phrase queries
  • Enables fielded searching (e.g. searches can be submitted that focus on title, author, contents, etc)
  • Allows for date-range searching so users can access time-sensitive information .

How Lucene Works?

Creating an index file is a necessary step in implementing a search application with Lucene. An index is a special database that contains a compiled version of the Web site content. While the Lucene indexing API automates the creation of the index, the content that will be included in the index must be in text format. For every document type to be included in an index, the customer will need to utilize a parser or extractor. Lucene includes a sample HTML parser that receives a URL or the location of a file on a hard drive, parses the file, extracts the text from the HTML tags, and creates a Java string object that is passed to the Lucene indexing API.

To better understand how indexing works, The first step in the indexing process is to identify the content that should be indexed. Lucene creates a "Document object", which is a collection of name-value pairs that are called "fields". For example, one field might be Title, so the name-value pair would be "Title" - "Akamai Home Page". Each field is then assigned a "field object", which determines if the text associated with the field should be indexed, stored, and/or tokenized.

Lucene Search in Liferay

Example - Bookmark Portlet

1. Implement a com.liferay.portal.kernel.search.Indexer class, like this one: http://lportal.svn.sourceforge.net/viewvc/lportal/portal/trunk/portal-impl/src/com/liferay/portlet/bookmarks/util/Indexer.java?view=markup

This class is responsible for adding/updating/deleting documents to the index, it uses SearchEngineUtil which is an abstraction to the underlying search engine that is being used (currently you can use Solr or Lucene, Lucene is the default engine).

When adding a document you create a Document instance and pass it to SearchEngineUtil.addDocument along with the companyId. A com.liferay.portal.kernel.search.Document is just a collection of Fields, a com.liferay.portal.kernel.search.Field has a name and a value which is the content that you want to search for, for example, a Bookmark entry has a name "This is a Bruno bookmark entry.", later on, you can search for it with this query: name:"bruno".

A call to document.addText will add a Field to the document and its value will be filtered before indexed. All words that don't help with to bring relevant search results will be removed (like punctuation, pronouns, adverbs), in the example above the bookmark entry name would become "bruno bookmark entry". addText is usually called when the content to be indexed is a long text.

A call to document.addKeyword will add a Field to the document and its value will be indexed with no modification.

document.addUID is very important because it's used to create a unique identifier to the document in the index. You will need it to update or delete the document later. There are some other useful methods to Document, like addFile, that extracts contents from many file formats, and addModifiedDate which adds the current Date as a field to the document.

Some fields are commonly used: groupId, portletId, companyId, they are useful because you can narrow your searches to only show documents that were created in a certain community or company or even a specific portlet instance.

Indexer.getDocumentSummary is used by the Search portlet to aggregate all results from all portlets in just one place, DocumentSummary will be used to render each document in the search result listing.

2. Add a <indexer-class> to liferay-portlet.xml pointing to this class. Whenever you hit the button Re-index the Indexer re-index method will be called.

3. Whenever a Bookmark entry is added/updated/deleted from the database the Indexer is called to update the index accordingly. Search for Indexer at: http://lportal.svn.sourceforge.net/viewvc/lportal/portal/trunk/portal-impl/src/com/liferay/portlet/bookmarks/service/impl/BookmarksEntryLocalServiceImpl.java?view=markup

Observe the places where Indexer is called:

Indexer.addEntry(entry.getCompanyId(), folder.getGroupId(), folderId, entryId, name, url, comments, tagsEntries); Indexer.deleteEntry(entry.getCompanyId(), entry.getEntryId()); Indexer.updateEntry(entry.getCompanyId(), folder.getGroupId(), entry.getFolderId(), entry.getEntryId(), name, url, comments, tagsEntries);

If you encounter a NullPointerException while retrieving your portlet's indexer (portlet.getIndexerInstance() returning null), check the portletId returned by YourIndexer.getPortletId() and keep in mind the portletId outside the portal will be "<portlet-name in portlet.xml>_WAR_<webapp name>"

4. Now that Bookmark entries can be added/updated and removed from the index you are able to make searches requests to SearchEngineUtil, take a look at: http://lportal.svn.sourceforge.net/viewvc/lportal/portal/trunk/portal-impl/src/com/liferay/portlet/bookmarks/service/impl/BookmarksFolderLocalServiceImpl.java?view=markup

Look at the search method: public Hits search(long companyId, long groupId, long folderIds, String keywords, int start, int end)

It constructs a com.liferay.portal.kernel.search.Query instance and calls SearchEngineUtil.search(companyId, fullQuery, start, end) which returns a Hits instance.

A BooleanQuery is one implementation of Query, you can specify which field values MUST (AND operator), SHOULD (OR operator) or MUST_NOT (NOT AND operator) occurrences in the results. You can create composite BooleanQueries by adding one to another.

Start and end parameters are used to paginate the results, for example if start == 0 and end == 2, only the first two entries will appear in the result, if both are equal to QueryUtil.ALL_POS (-1) than all results will be returned.

5. Now that you have the search method you can use it in a jsp, for example: http://lportal.svn.sourceforge.net/viewvc/lportal/portal/trunk/portal-web/docroot/html/portlet/bookmarks/search.jsp?view=markup

This jsp calls the search method and iterates over com.liferay.portal.kernel.search.Hits which is a collection of Documents and represents the result of the search. The Hits class gives the score (the importance) of each document, tells you how long the search took and how many documents were found.

On each Document you can get the values of each field by calling document.get(String fieldName) or getValues in case the Field has multiples stored values.

6. If you want that your portlet will be listed in the Search portlet results you need to do one more step, create a com.liferay.portal.kernel.search.OpeanSearch implementation and add a <open-search-class> to your liferay-portlet.xml. Take a look at how this is implemented for Bookmark portlet:

http://lportal.svn.sourceforge.net/viewvc/lportal/portal/trunk/portal-impl/src/com/liferay/portlet/bookmarks/util/BookmarksOpenSearchImpl.java?view=markup

 



 Sibi Thomas

5 comments:

Robergg said...

Good, very good! ! ! Thank you to let me see these articles and pictures, it's cool, I'm very excited, which to me is incredibly, I finally found the soul mate and your works not only real, and vivid! I love to see you as soon as possible, hope you update of work!!!Longines Navigator Sunglasses

Anonymous said...

i want to know more how to extend this search by adding more filters like search keywords by any words, exact phrase & all words.

thanks, hope to hear from u soon

Anonymous said...

Hi , i use lucene search in liferay but colon : character is not accepted. when i search for example worm:data , i found 0 zero result. how to overcome this?could u pls help me.

Anonymous said...

Hi , i m new to liferay..i want to create search portlet using lucene..can any one suggest me the steps to do..

thanks..

Anonymous said...

Really nice explanation, so simple, like it.

Post a Comment

Share & Enjoy

Twitter Delicious Facebook Digg Stumbleupon Favorites More