The retrieval package provides basic indexing and scoring functionality based on Elasticsearch (v2.3). It can be used both for documents and for entities (as the latter are represented as fielded documents).
Indexing can be done be done by directly reading the content of documents.
toy_indexer module provides a toy example.
When the content of documents is stored in MongoDB (e.g., for DBpedia entities), use the
indexer_mongo module for indexing.
For further details on how this module can be used, see
For indexing Dbpedia entities, we read the content of entiteis form MongoDB aFor DBpedia entities, we store them on MongoDB and .. todo:: Explain indexing (representing entities as fielded documents, mongo to elasticsearch)
- To speed up indexing, use
add_docs_bulk(). The optimal number of documents to send in a single bulk depends on the size of documents; you need to figure it out experimentally.
- We strongly recommend using the default Elasticsearch similarity (currently BM25) for indexing. (Other similarity functions may be also used; in that case the similarity function can updated after indexing.)
- Our default setting is not to store term positions in the index (for efficiency considerations).
Retrieval is done in two stages:
- First pass: The top
Ndocuments are retrieved using Elastic’s default search method
- Second pass: The (expensive) scoring of the top
Ndocuments is performed (implemented in the Nordlys)
Nordlys currently supports the following models for second pass retrieval:
- Language modelling (LM) 
- Mixture of Language Modesl (MLM) 
- Probabilistic Model for Semistructured Data (PRMS) 
scorer module to get inspiration for implementing a new retrieval model.
- Always use a
ElasticCacheobject (instead of
Elastic) for getting stats from the index. This class stores index stats in the memory, which highly benefits efficiency.
- We recommend to create a new
ElasticCacheobject for each query. This way, you will make effiecnt of your machine’s memory.
 Jay M Ponte and W Bruce Croft. 1998. A Language modeling approach to information retrieval. In Proc. of SIGIR ‘98.
 Paul Ogilvie and Jamie Callan. 2003. Combining document representations for known-item search. Proc. of SIGIR ‘03.
 Jinyoung Kim, Xiaobing Xue, and W Bruce Croft. 2009. A probabilistic retrieval model for semistructured data. In Proc. of ECIR ‘09.
- nordlys.core.retrieval.elastic module
- nordlys.core.retrieval.elastic_cache module
- nordlys.core.retrieval.indexer_mongo module
- nordlys.core.retrieval.retrieval module
- nordlys.core.retrieval.retrieval_results module
- nordlys.core.retrieval.scorer module
- nordlys.core.retrieval.toy_indexer module