GitLab's annual major release is around the corner. Along with a lot of new and exciting features, there will be a few breaking changes. Learn more here.

Commit 75a41a14 authored by Lina Qiu's avatar Lina Qiu
Browse files

Initial commit

parents
File added
# Intellij config
.idea/
target/
*.iml
# Eclipse config
bin/
.project
.classpath
.settings/
.vscode
Dockerfile
docker-compose.yml
\ No newline at end of file
Luwak 1.5.0
===========
This release of luwak requires Java 8
API changes:
* Upgrade to lucene 6.5
* Monitor.update() is declared as throwing an UpdateException (#106)
Optimizations:
* Disable bulk scoring, reducing fixed overhead for large boolean disjunctions
* Query decomposer will recurse into DisjunctionMaxQuery (#112)
New features:
* HighlightingMatcher can handle exact phrase queries
* HighlightingMatcher reports matches even if it can't actually highlight a
query
* Several code cleanups contributed by Bloomberg (#100, #102, #107, #118, #119)
* Expose hits map in HighlightsMatch (#92)
* All lucene core queries are now handled by TermFilteredPresearcher
* SpanRewriter can handle ConstantScoreQuery (#114)
* SpanRewriter will attempt to rewrite unknown queries. This allows luwak to
highlight queries produced by, for example, the ComplexPhraseQueryParser
(#136)
Bug fixes:
* MonitorQuery metadata checks for null values
* InputDocument propagates position and offset gaps for multivalued fields
(#117)
* SpanRewriter.rewriteBoolean() propagates minShouldMatch (#123)
* HighlightingMatcher was only returning hits for the last doc in a batch (#134)
* Unpositioned span terms were being incorrectly highlighted due to two-phase
iteration (fixed by LUCENE-7628)
* InputDocument was not respecting omitNorms values on fields (fixed by
LUCENE-7696) (#135)
Luwak 1.4.0
===========
API changes:
* Upgrade to lucene 5.4.1
Optimizations:
* Refactored the internal query index code into its own class, for easier
testing
* Avoid re-analyzing queries that have already been added
New features:
* Added better toString() representations of QueryMatch objects
* Several code cleanups contributed by DevFactory
Bug fixes:
* Several fixes to the HighlightingMatcher span-rewrite code
* Luwak-specific queries correctly implement equals() and hashCode(), fixing
some query-cache interaction bugs.
* HighlightsMatch.Hit now implements hashCode()
* QueryTreeAnalyzer uses a LinkedHashSet to enforce deterministic iteration
order over its children
Luwak 1.3.0
===========
The major change here is the introduction of the batching API, allowing users to
match several documents at once. In general this will improve throughput at the
expense of latency, although as always you should measure timings carefully.
API changes:
* Added Monitor.match(DocumentBatch, MatcherFactory) and
Monitor.debug(DocumentBatch, MatcherFactory) in addition to the existing
InputDocument methods.
* QueryMatch now takes a document ID as well as a query ID
* Matches<T> exposes DocumentMatches<T> objects for document in an input batch
* Presearcher.buildQuery() now takes a QueryTermFilter rather than an
IndexReaderContext for filtering out terms, and a LeafReader over a batch's
document index rather than an InputDocument
* Various queryindex configuration options (commit batch size, cache purge
frequency, whether or not to store queries, which QueryDecomposer to use) are
now encapsulated in a QueryIndexConfiguration object, passed to the Monitor
constructor.
* Monitor state updates can now be watched by registering
QueryIndexUpdateListeners. The beforeCommit() and afterCommit() methods are
removed.
* Monitor.getStats() is renamed to Monitor.getQueryCacheStats()
New features:
* You can now configure Similarity implementations on a DocumentBatch
* Query metadata is passed to CandidateMatcher.doMatchQuery()
* DisjunctionMaxQuery can be highlighted
* DisjunctionMaxQuery can be decomposed and indexed separately
* Performance improvements: use TermsQuery for the document disjunction, and
filter out non-existent query terms before they're added to the query
* ConcurrentQueryLoader allows more efficient query loading on startup
* WildcardNGramPresearcherComponent makes it easier to adjust its max token
length
* You can choose not to store MonitorQueries in the Monitor for some memory
savings, by setting QueryIndexConfiguration.storeQueries(false).
Bug fixes:
* The slowlog was reporting nanosecond values as milliseconds
* FieldFilterPresearcherComponent didn't work for binary terms
* FieldFilterPresearcherComponent was breaking debug()
Copyright (c) 2013 Lemur Consulting Ltd.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied.
See the License for the specific language governing permissions and
limitations under the License.
indexer thread purge thread
clone tlog
add docs <-- dropped!
collect
commit
clear tlog
swap tlog
-----------------------------------------
* add docs + commit is guarded by a read lock
* purge thread creates purge log
* purge log creation guarded by write lock
* if purge log exists, indexer adds its queries to it
* purge thread adds purge log contents after collectiong
add docs |_
add to purge log | <-- no purge log, so dropped!
-| create purge log
collect
commit
_| copy purge log
| swap cache
So I need to make adding docs to the cache and committing protected by the
readlock.
* update _parses_ the docs, but doesn't add them to the cache
* commit() now takes a list of CacheEntry objects
add & parse docs
-| create purge log
collect
_| copy purge log
| swap cache
add to cache |
add to purge log |-
commit |
* problem now is: what happens if another thread commits before the cache is
populated? We can get NPEs from the cache
- silently drop? We don't guarantee that the doc is there until update()
returns, so I think this is OK.
add & parse docs
delete docs
commit
match in index
retrieve from cache -> NPE!
add to cache |
add to purge log |-
commit |
Luwak - stored query engine from Flax
=====================================
[![Build
Status](https://travis-ci.org/flaxsearch/luwak.svg?branch=master)](https://travis-ci.org/flaxsearch/luwak)
What is Luwak?
--------------
Based on the open source Lucene search library, Luwak is a high performance stored query engine. Simply put, it allows you to define a set of search queries and then monitor a stream of documents for any that might match these queries: a function also known as 'reverse search' and 'document routing'. Flax developed Luwak for clients who monitor high volumes of news using often extremely complex Boolean expressions. Luwak is being used by companies including Infomedia, Bloomberg http://www.flax.co.uk/blog/2016/03/08/helping-bloomberg-build-real-time-news-search-engine/ and Booz Allen Hamilton.
You can find out a bit more about how Flax use Luwak for media monitoring applications in
this video from Lucene Revolution 2013 http://www.youtube.com/watch?v=rmRCsrJp2A8 and this video
from Berlin Buzzwords 2014 http://berlinbuzzwords.de/session/turning-search-upside-down-search-queries-documents and how we combined it with Apache Samza (including a great illustration of how Luwak internals work) http://www.flax.co.uk/blog/2015/08/26/real-time-full-text-search-with-luwak-and-samza/
Here's some tests we did to compare Luwak to Elasticsearch Percolator:
http://www.flax.co.uk/blog/2015/07/27/a-performance-comparison-of-streamed-search-implementations/
Scott Stults of Open Source Connections wrote "How to use Luwak to run preset queries against incoming documents":
http://opensourceconnections.com/blog/2016/02/05/luwak and Ryan Walker wrote a complete streaming search implementation with Luwak http://insightdataengineering.com/blog/streaming-search/
Get the artifacts
-----------------
```
<dependency>
<groupId>com.github.flaxsearch</groupId>
<artifactId>luwak</artifactId>
<version>1.4.0</version>
</dependency>
```
Using the monitor
-----------------
Basic usage looks like this:
```java
Monitor monitor = new Monitor(new LuceneQueryParser("field"), new TermFilteredPresearcher());
MonitorQuery mq = new MonitorQuery("query1", "field:text");
List<QueryError> errors = monitor.update(mq);
// match one document at a time
InputDocument doc = InputDocument.builder("doc1")
.addField(textfield, document, new StandardAnalyzer())
.build();
Matches<QueryMatch> matches = monitor.match(doc, SimpleMatcher.FACTORY);
// or match a batch of documents
Matches<QueryMatch> matches = monitor.match(DocumentBatch.of(listOfDocuments), SimpleMatcher.FACTORY);
```
Adding queries
--------------
The monitor is updated using MonitorQuery objects, which consist of an id, a query string, and an
optional metadata map. The monitor uses its provided MonitorQueryParser
to parse the query strings and cache query objects.
In Luwak 1.5.0, errors thrown when adding queries (from query parsing, for example) cause an
UpdateException to be thrown, detailing which queries could not be added.
In Luwak 1.4 and below, ```Monitor.update()``` returns a list of ```QueryError``` objects, which should
be checked for parse errors. The list will be empty if all queries were added successfully.
Matching documents
------------------
Queries selected by the monitor to run against an InputDocument are passed to a CandidateMatcher
class. Four basic implementations are provided:
* SimpleMatcher - reports which queries matched the InputDocument
* ScoringMatcher - reports which queries matched, with their scores
* ExplainingMatcher - reports which queries matched, with an explanation for their scores
* HighlightingMatcher - reports which queries matched, with the individual matches for each query
In addition, luwak has two multithreaded matchers which wrap the simpler matchers:
* ParallelMatcher - runs queries in multiple threads as they are collected from the Monitor
* PartioningMatcher - collects queries, partitions them into groups, and then runs each group in its own thread
Running the demo
----------------
A small demo program is included in the distribution that will run queries provided
in a text file over a small corpus of documents from project gutenberg (via nltk).
```sh
./run-demo
```
Filtering out queries
---------------------
The monitor uses a ```Presearcher``` implementation to reduce the number of queries it runs
during a ```match``` run. Luwak comes with three presearcher implementations.
### MatchAllPresearcher
This Presearcher does no filtering whatsoever, so the monitor will run all its registered
queries against every document passed to ```match```.
### TermFilteredPresearcher
This Presearcher extracts terms from each registered query and indexes the queries against them
in the Monitor's internal index. At match-time, the passed-in ```InputDocument``` is tokenized
and converted to a disjunction query. All queries that match this query in the monitor's index
are then run against the document.
### MultipassTermFilteredPresearcher
An extension of ```TermFilteredPresearcher``` that tries to improve filtering on phrase queries
by indexing different combinations of terms.
The TermFilteredPresearcher can be configured with different ```PresearcherComponent```
implementations - for example, you can ignore certain fields with a ```FieldFilterPresearcherComponent```,
or get accurate filtering on wildcard queries with an ```WildcardNGramPresearcherComponent```.
Adding new query types
----------------------
```TermFilteredPresearcher``` extracts terms from queries by using a ```QueryAnalyzer``` to build
a tree representation of the query, and then selecting the best possible set of terms from that tree
that uniquely identify the query. The tree is built using a set of specialized ```QueryTreeBuilder```
implementations, one for each lucene ```Query``` class.
This will not be appropriate for all custom Query types. You can create your own custom tree builder by
subclassing ```QueryTreeBuilder```, and then pass it to the ```TermFilteredPresearcher``` in
a ```PresearcherComponent```.
```java
public class CustomQueryTreeBuilder extends QueryTreeBuilder<CustomQuery> {
public CustomQueryTreeBuilder() {
super(CustomQuery.class);
}
@Override
public QueryTree buildTree(QueryAnalyzer builder, CustomQuery query) {
return new TermNode(getYourTermFromCustomQuery(query));
}
}
...
Presearcher presearcher = new TermFilteredPresearcher(new PresearcherComponent(new CustomerQueryTreeBuilder()));
```
Customizing the existing presearchers
-------------------------------------
Not all terms extracted from a query need to be indexed, and the fewer terms indexed, the
more performant the presearcher filter will be. For example, a BooleanQuery with many SHOULD
clauses but only a single MUST clause only needs to index the terms extracted from the MUST
clause. Terms in a parsed query tree are given weights and the ```QueryAnalyzer``` uses these
weights to decide which terms to extract and index. The weighting is done by a ```TreeWeightor```.
Weighting is configured by a ```WeightPolicy```, which will contain a set of ```WeightNorm```s and
a ```CombinePolicy```. A query term will be run through all the ```WeightNorm``` objects to determine
its overall weighting, and a parent query will then calculate its weight with the ```CombinePolicy```,
passing in all child weights.
The following ```WeightNorm``` implementations are provided:
* FieldWeightNorm - weight all terms in a given field
* FieldSpecificTermWeightNorm - weight specific terms in specific fields
* TermTypeNorm - weight terms according to their type (EXACT terms, ANY terms, CUSTOM terms)
* TermWeightNorm - weight a specific set of terms with a given value
* TokenLengthNorm - weight a term according to its length
* TermFrequencyWeightNorm - weight a term by its term frequency
A single ```CombinePolicy``` is provided:
* MinWeightCombiner - a parent node's weight is set to the minimum weight of its children
You can create your own rules, or combine existing ones
```java
WeightPolicy weightPolicy = WeightPolicy.Default(new FieldWeightNorm("category", 0.01f));
CombinePolicy combinePolicy = new MinWeightCombiner();
TreeWeightor weightor = new TreeWeightor(weightPolicy, combinePolicy);
Presearcher presearcher = new TermFilteredPresearcher(weightor);
```
You can debug the output of any weightor by using a ```ReportingWeightor```. ```QueryTreeViewer```
is a convenience class that may help here.
Creating an entirely new type of Presearcher
--------------------------------------------
You can implement your own query filtering code by subclassing ```Presearcher```. You will need
to implement ```buildQuery(InputDocument, QueryTermFilter)``` which converts incoming documents into queries to
be run against the Monitor's query index, and ```indexQuery(Query, Map<String,String>)``` which converts registered
queries into a form that can be indexed.
Note that ```indexQuery(Query, Map<String,String>)``` may not create fields named '_id' or '_query', as these are reserved
by the Monitor's internal index.
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>luwak-parent</artifactId>
<groupId>com.github.flaxsearch</groupId>
<version>1.6.0-SNAPSHOT</version>
</parent>
<modelVersion>4.0.0</modelVersion>
<artifactId>luwak-benchmark</artifactId>
<dependencies>
<dependency>
<groupId>com.github.flaxsearch</groupId>
<artifactId>luwak</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>18.0</version>
</dependency>
<dependency>
<groupId>io.dropwizard.metrics</groupId>
<artifactId>metrics-core</artifactId>
<version>3.1.0</version>
</dependency>
</dependencies>
</project>
//package uk.co.flax.luwak.benchmark;
//
///*
// * Copyright (c) 2015 Lemur Consulting Ltd.
// *
// * Licensed under the Apache License, Version 2.0 (the "License");
// * you may not use this file except in compliance with the License.
// * You may obtain a copy of the License at
// *
// * http://www.apache.org/licenses/LICENSE-2.0
// *
// * Unless required by applicable law or agreed to in writing, software
// * distributed under the License is distributed on an "AS IS" BASIS,
// * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// * See the License for the specific language governing permissions and
// * limitations under the License.
// */
//
//import java.io.IOException;
//import java.util.Iterator;
//import java.util.List;
//
//import com.google.common.collect.Iterables;
//import uk.co.flax.luwak.*;
//
//public class Benchmark {
//
// private Benchmark() {}
//
// public static <T extends QueryMatch> BenchmarkResults<T> run(Monitor monitor, Iterable<InputDocument> documents,
// int batchsize, MatcherFactory<T> matcherFactory) throws IOException {
// BenchmarkResults<T> results = new BenchmarkResults<>();
// for (DocumentBatch batch : batchDocuments(documents, batchsize)) {
// Matches<T> matches = monitor.match(batch, matcherFactory);
// results.add(matches);
// }
// return results;
// }
//
// public static Iterable<DocumentBatch> batchDocuments(Iterable<InputDocument> documents, int batchsize) {
// Iterable<List<InputDocument>> partitions = Iterables.partition(documents, batchsize);
// final Iterator<List<InputDocument>> it = partitions.iterator();
// return new Iterable<DocumentBatch>() {
// @Override
// public Iterator<DocumentBatch> iterator() {
// return new Iterator<DocumentBatch>() {
// @Override
// public boolean hasNext() {
// return it.hasNext();
// }
//
// @Override
// public DocumentBatch next() {
// return DocumentBatch.of(it.next());
// }
//
// @Override
// public void remove() {
// throw new UnsupportedOperationException();
// }
// };
// }
// };
// }
//
// public static BenchmarkResults<PresearcherMatch> timePresearcher(Monitor monitor, int batchsize, Iterable<InputDocument> documents)
// throws IOException {
// return run(monitor, documents, batchsize, PresearcherMatcher.FACTORY);
// }
//
// public static <T extends QueryMatch> ValidatorResults<T> validate(Monitor monitor, Iterable<ValidatorDocument<T>> documents,
// MatcherFactory<T> matcherFactory) throws IOException {
// ValidatorResults<T> results = new ValidatorResults<>();
// for (ValidatorDocument<T> doc : documents) {
// Matches<T> matches = monitor.match(doc.getDocument(), matcherFactory);
// results.add(matches, doc.getDocument().getId(), doc.getExpectedMatches());
// }
// return results;
// }
//
//}
package uk.co.flax.luwak.benchmark;
/*
* Copyright (c) 2015 Lemur Consulting Ltd.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
import java.io.ByteArrayOutputStream;
import java.io.PrintStream;
import java.io.UnsupportedEncodingException;
import java.nio.charset.StandardCharsets;
import java.util.concurrent.TimeUnit;
import com.codahale.metrics.ConsoleReporter;
import com.codahale.metrics.Histogram;
import com.codahale.metrics.MetricRegistry;
import com.codahale.metrics.Timer;
import uk.co.flax.luwak.Matches;
import uk.co.flax.luwak.QueryMatch;
public class BenchmarkResults<T extends QueryMatch> {
private final MetricRegistry metrics = new MetricRegistry();
private final Timer timer = metrics.timer("searchTimes");
private final Histogram queryBuildTimes = metrics.histogram("queryBuildTimes");
public void add(Matches<T> benchmarkMatches) {
timer.update(benchmarkMatches.getSearchTime(), TimeUnit.MILLISECONDS);
queryBuildTimes.update(benchmarkMatches.getQueryBuildTime());
}
public Timer getTimer() {
return timer;
}
@Override
public String toString() {
ByteArrayOutputStream os = new ByteArrayOutputStream();
try {
PrintStream out = new PrintStream(os, true, StandardCharsets.UTF_8.name());
ConsoleReporter.forRegistry(metrics).outputTo(out).build().report();
return os.toString(StandardCharsets.UTF_8.name());
} catch (UnsupportedEncodingException e) {
throw new RuntimeException(e);
}
}
}
package uk.co.flax.luwak.benchmark;
/*
* Copyright (c) 2015 Lemur Consulting Ltd.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
import uk.co.flax.luwak.QueryMatch;
public class PresearcherMatch extends QueryMatch {
public PresearcherMatch(String queryId, String docId) {
super(queryId, docId);
}
}<