nordlys.core.retrieval package¶
Retrieval¶
The retrieval package provides basic indexing and scoring functionality based on Elasticsearch (v2.3). It can be used both for documents and for entities (as the latter are represented as fielded documents).
Indexing¶
Indexing can be done be done by directly reading the content of documents.
The toy_indexer
module provides a toy example.
When the content of documents is stored in MongoDB (e.g., for DBpedia entities), use the indexer_mongo
module for indexing.
For further details on how this module can be used, see indexer_dbpedia
.
For indexing Dbpedia entities, we read the content of entiteis form MongoDB aFor DBpedia entities, we store them on MongoDB and .. todo:: Explain indexing (representing entities as fielded documents, mongo to elasticsearch)
Notes¶
- To speed up indexing, use
add_docs_bulk()
. The optimal number of documents to send in a single bulk depends on the size of documents; you need to figure it out experimentally. - We strongly recommend using the default Elasticsearch similarity (currently BM25) for indexing. (Other similarity functions may be also used; in that case the similarity function can updated after indexing.)
- Our default setting is not to store term positions in the index (for efficiency considerations).
Retrieval¶
Retrieval is done in two stages:
- First pass: The top
N
documents are retrieved using Elastic’s default search method - Second pass: The (expensive) scoring of the top
N
documents is performed (implemented in the Nordlys)
Nordlys currently supports the following models for second pass retrieval:
- Language modelling (LM) [1]
- Mixture of Language Modesl (MLM) [2]
- Probabilistic Model for Semistructured Data (PRMS) [3]
Check out scorer
module to get inspiration for implementing a new retrieval model.
Command line usage¶
Notes¶
- Always use a
ElasticCache
object (instead ofElastic
) for getting stats from the index. This class stores index stats in the memory, which highly benefits efficiency. - We recommend to create a new
ElasticCache
object for each query. This way, you will make effiecnt of your machine’s memory.
[1] Jay M Ponte and W Bruce Croft. 1998. A Language modeling approach to information retrieval. In Proc. of SIGIR ‘98.
[2] Paul Ogilvie and Jamie Callan. 2003. Combining document representations for known-item search. Proc. of SIGIR ‘03.
[3] Jinyoung Kim, Xiaobing Xue, and W Bruce Croft. 2009. A probabilistic retrieval model for semistructured data. In Proc. of ECIR ‘09.