nordlys.core.retrieval.elastic module¶
Elastic¶
Utility class for working with Elasticsearch. This class is to be instantiated for each index.
Indexing usage¶
To create an index, first you need to define field mappings and then build the index.
The sample code for creating an index is provided at nordlys.core.retrieval.toy_indexer
.
Retrieval usage¶
The following statistics can be obtained from this class:
- Number of documents:
Elastic.num_docs()
- Number of fields:
Elastic.num_fields()
- Document count:
Elastic.doc_count()
- Collection length:
Elastic.coll_length()
- Average length:
Elastic.avg_len()
- Document length:
Elastic.doc_length()
- Document frequency:
Elastic.doc_freq()
- Collection frequency:
Elastic.coll_term_freq()
- Term frequencies:
Elastic.term_freqs()
Efficiency considerations¶
- For efficiency reasons, we do not store term positions during indexing. To store them, see the corresponding mapping functions
Elastic.analyzed_field()
,Elastic.notanalyzed_searchable_field()
.- Use
ElasticCache
for getting index statistics. This module caches the statistics into memory and boosts efficeicny.- Mind that
ElasticCache
does not empty the cache!
Authors: | Faegheh Hasibi, Krisztian Balog |
---|
-
class
nordlys.core.retrieval.elastic.
Elastic
(index_name)[source]¶ Bases:
object
-
ANALYZER_STOP
= 'stop_en'¶
-
ANALYZER_STOP_STEM
= 'english'¶
-
BM25
= 'BM25'¶
-
DOC_TYPE
= 'doc'¶
-
FIELD_CATCHALL
= 'catchall'¶
-
FIELD_ELASTIC_CATCHALL
= '_all'¶
-
SIMILARITY
= 'sim'¶
-
add_doc
(doc_id, contents)[source]¶ Adds a document with the specified contents to the index.
Parameters: - doc_id – document ID
- contents – content of document
-
add_docs_bulk
(docs)[source]¶ Adds a set of documents to the index in a bulk.
Parameters: docs – dictionary {doc_id: doc}
-
analyze_query
(query, analyzer='stop_en')[source]¶ Analyzes the query.
Parameters: - query – raw query
- analyzer – name of analyzer
-
static
analyzed_field
(analyzer='stop_en')[source]¶ Returns the mapping for analyzed fields.
For efficiency considerations, term positions are not stored. To store term positions, change
"term_vector": "with_positions_offsets"
Parameters: analyzer – name of the analyzer; valid options: [ANALYZER_STOP, ANALYZER_STOP_STEM]
-
coll_term_freq
(term, field, tv=None)[source]¶ Returns collection term frequency for the given field.
-
create_index
(mappings, model='BM25', model_params=None, force=False)[source]¶ Creates index (if it doesn’t exist).
Parameters: - mappings – field mappings
- model – name of elastic search similarity
- model_params – name of elastic search similarity
- force – forces index creation (overwrites if already exists)
-
get_doc
(doc_id, fields=None, source=True)[source]¶ Gets a document from the index based on its ID.
Parameters: - doc_id – document ID
- fields – list of fields to return (default: all)
- source – return document source as well (default: yes)
-
search
(query, field, num=100, fields_return='', start=0)[source]¶ Searches in a given field using the similarity method configured in the index for that field.
Parameters: - query – query string
- field – field to search in
- num – number of hits to return (default: 100)
- fields_return – additional document fields to be returned
- start – starting offset (default: 0)
Returns: dictionary of document IDs with scores
-
search_complex
(body, num=10, fields_return='', start=0)[source]¶ Supports complex structured queries, which are sent as a
body
field in Elastic search. For detailed information on formulating structured queries, see the official instructions. Below is an example to search in two particular fields that each must contain a specific term.Example: # [explanation of the query] term_1 = "hello" term_2 = "world" body = { "query": { "bool": { "must": [ { "match": {"title": term_1} }, { "match_phrase": {"content": term_2} } ] } } }
Parameters: - body – query body
- field – field to search in
- num – number of hits to return (default: 100)
- fields_return – additional document fields to be returned
- start – starting offset (default: 0)
Returns: dictionary of document IDs with scores
-
term_freqs
(doc_id, field, tv=None)[source]¶ Returns term frequencies of all terms for a given document and field.
-
update_similarity
(model='BM25', params=None)[source]¶ Updates the similarity function “sim”, which is fixed for all index fields.
The method and param should match elastic settings: https://www.elastic.co/guide/en/elasticsearch/reference/2.3/index-modules-similarity.htmlParameters: - model – name of the elastic model
- params – dictionary of params based on elastic
-