Shakespeare's Seven Monkeys

12 December 2012

An Initial Adventure with Lucene.

This is the first exercise in a tutorial series introducing Lucene, the text search engine library. Source for the exercises in this series is available on Github and the only prerequisite for running the initial exercises is Groovy 2.0. The texts that will be indexed in the exercises come from Project Gutenberg. This exercise will illustrate by walking through a Groovy script how simple it is to index a document and in turn search for terms by indexing ‘The complete works of Shakespeare’ and allowing for a single term search to be performed. First we shall use a feature in Groovy called @Grab which will add the required dependencies for Lucene onto the script’s classpath.

// @Grab is a nice feature for getting dependencies added to the classpath.
@Grab(group='org.apache.lucene', module='lucene-core', version='4.0.0')
@Grab(group='org.apache.lucene', module='lucene-queryparser', version='4.0.0')
@Grab(group='org.apache.lucene', module='lucene-analyzers-common', version='4.0.0')
@Grab(group='org.apache.lucene', module='lucene-queries', version='4.0.0')

import org.apache.lucene.analysis.*
import org.apache.lucene.analysis.standard.*
import org.apache.lucene.document.*
import org.apache.lucene.index.*
import org.apache.lucene.queryparser.flexible.standard.*
import org.apache.lucene.search.*
import org.apache.lucene.store.*
import org.apache.lucene.util.*

Next we shall create an IndexWriter using a standard analyser and a RAMDirectory where the index will be stored for the duration of the script.

// This script indexes the text from shakespeare.txt and indexes each line.

// Search for a line containing the first argument if passed,
// otherwise search for lines with monkey.
def searchTerm = this.args.length > 0 ? "line:${this.args[0]}" : "line:monkey"

// Setup required lucene objects for writing to the lucene index.
def indexDirectory = new RAMDirectory();
def analyzer = new StandardAnalyzer(Version.LUCENE_40)
def writerConfiguration = new IndexWriterConfig(Version.LUCENE_40, analyzer)
def indexWriter = new IndexWriter(indexDirectory, writerConfiguration);

We shall then use an anonymous closure to add each line of ‘The complete works of Shakespeare’ into the index along with its associated line number.

// Index the shakespeare text file line by line.
new File("shakespeare.txt").readLines().eachWithIndex { line, lineNumber ->
	Document doc = new Document();
	doc.add(new IntField("lineNumber", lineNumber, Field.Store.YES))
	doc.add(new TextField("line", line, Field.Store.YES))
	indexWriter.addDocument(doc)
}

With the indexing of ‘The complete works of Shakespeare’ finished, it is now time to search for lines which contain a term.

// Print out each line which matches the search term, with a return limit of 10000 matches.
def indexReader = indexWriter.getReader()
def query = new StandardQueryParser(analyzer).parse(searchTerm, "")
def indexSearcher = new IndexSearcher(indexReader)
def hits =  indexSearcher.search(query, 10000).scoreDocs

hits.collect{indexSearcher.doc(it.doc)}.each{ println "${it.lineNumber} ${it.line}"}
println "${hits.length} matches for ${searchTerm - 'line:'} found."

// Tidy up resources
indexReader.close()
indexWriter.close()

Finally it is time to run the script and see which lines match our term.

<$groovy IndexAndSearchShakespeare.groovy Monkey
74109 for a monkey.
105489 Into baboon and monkey.
79411 On meddling monkey, or on busy ape,
33770 was the very genius of famine; yet lecherous as a monkey, and the
104044 CALIBAN. Thou liest, thou jesting monkey, thou;
12445 an ape, more giddy in my desires than a monkey. I will weep for
68612 LADY MACDUFF. Now, God help thee, poor monkey! But how wilt thou do
7 matches for Monkey found.
<$

Try using wildcards as long as they are not the first character (as this breaks the rules for the Lucene Query Syntax). For example: groovy IndexAndSearchShakespeare.groovy Monk*.

See you next time...

Article By
blog author

David McFarland

Principal Engineer