nordlys.core.retrieval.elastic module¶

Elastic¶

Utility class for working with Elasticsearch. This class is to be instantiated for each index.

Indexing usage¶

To create an index, first you need to define field mappings and then build the index. The sample code for creating an index is provided at nordlys.core.retrieval.toy_indexer.

Retrieval usage¶

The following statistics can be obtained from this class:

Number of documents: Elastic.num_docs()

Number of fields: Elastic.num_fields()

Document count: Elastic.doc_count()

Collection length: Elastic.coll_length()

Average length: Elastic.avg_len()

Document length: Elastic.doc_length()

Document frequency: Elastic.doc_freq()

Collection frequency: Elastic.coll_term_freq()

Term frequencies: Elastic.term_freqs()

Efficiency considerations¶

For efficiency reasons, we do not store term positions during indexing. To store them, see the corresponding mapping functions Elastic.analyzed_field(), Elastic.notanalyzed_searchable_field().

Use ElasticCache for getting index statistics. This module caches the statistics into memory and boosts efficeicny.

Mind that ElasticCache does not empty the cache!

Authors:	Faegheh Hasibi, Krisztian Balog

class nordlys.core.retrieval.elastic.Elastic(index_name)[source]¶

Bases: object

ANALYZER_STOP = 'stop_en'¶

ANALYZER_STOP_STEM = 'english'¶

BM25 = 'BM25'¶

DOC_TYPE = 'doc'¶

FIELD_CATCHALL = 'catchall'¶

FIELD_ELASTIC_CATCHALL = '_all'¶

SIMILARITY = 'sim'¶

add_doc(doc_id, contents)[source]¶

Adds a document with the specified contents to the index.

Parameters:	doc_id – document ID contents – content of document

add_docs_bulk(docs)[source]¶

Adds a set of documents to the index in a bulk.

Parameters:	docs – dictionary {doc_id: doc}

analyze_query(query, analyzer='stop_en')[source]¶

Analyzes the query.

Parameters:	query – raw query analyzer – name of analyzer

static analyzed_field(analyzer='stop_en')[source]¶

Returns the mapping for analyzed fields.

For efficiency considerations, term positions are not stored. To store term positions, change "term_vector": "with_positions_offsets"

Parameters:	analyzer – name of the analyzer; valid options: [ANALYZER_STOP, ANALYZER_STOP_STEM]

avg_len(field)[source]¶: Returns average length of a field in the collection.

coll_length(field)[source]¶: Returns length of field in the collection.

coll_term_freq(term, field, tv=None)[source]¶: Returns collection term frequency for the given field.

create_index(mappings, model='BM25', model_params=None, force=False)[source]¶

Creates index (if it doesn’t exist).

Parameters:	mappings – field mappings model – name of elastic search similarity model_params – name of elastic search similarity force – forces index creation (overwrites if already exists)

delete_index()[source]¶: Deletes an index.

doc_count(field)[source]¶: Returns number of documents with at least one term for the given field.

doc_freq(term, field, tv=None)[source]¶: Returns document frequency for the given term and field.

doc_length(doc_id, field)[source]¶: Returns length of a field in a document.

get_doc(doc_id, fields=None, source=True)[source]¶

Gets a document from the index based on its ID.

Parameters:	doc_id – document ID fields – list of fields to return (default: all) source – return document source as well (default: yes)

get_field_stats(field)[source]¶: Returns stats of the given field.

get_fields()[source]¶: Returns name of fields in the index.

get_mapping()[source]¶: Returns mapping definition for the index.

get_settings()[source]¶: Returns index settings.

static notanalyzed_field()[source]¶: Returns the mapping for not-analyzed fields.

static notanalyzed_searchable_field()[source]¶: Returns the mapping for not-analyzed fields.

num_docs()[source]¶: Returns the number of documents in the index.

num_fields()[source]¶: Returns number of fields in the index.

search(query, field, num=100, fields_return='', start=0)[source]¶

Searches in a given field using the similarity method configured in the index for that field.

Parameters:	query – query string field – field to search in num – number of hits to return (default: 100) fields_return – additional document fields to be returned start – starting offset (default: 0)
Returns:	dictionary of document IDs with scores

search_complex(body, num=10, fields_return='', start=0)[source]¶

Supports complex structured queries, which are sent as a body field in Elastic search. For detailed information on formulating structured queries, see the official instructions. Below is an example to search in two particular fields that each must contain a specific term.

Example:

# [explanation of the query]
term_1 = "hello"
term_2 = "world"
body = {
    "query": {
        "bool": {
            "must": [
                    {
                "match": {"title": term_1}
                    },
                    {
                "match_phrase": {"content": term_2}
                    }
                    ]
                }
            }
        }

Parameters:	body – query body field – field to search in num – number of hits to return (default: 100) fields_return – additional document fields to be returned start – starting offset (default: 0)
Returns:	dictionary of document IDs with scores

term_freq(doc_id, field, term)[source]¶: Returns frequency of a term in a given document and field.

term_freqs(doc_id, field, tv=None)[source]¶: Returns term frequencies of all terms for a given document and field.

update_similarity(model='BM25', params=None)[source]¶

Updates the similarity function “sim”, which is fixed for all index fields.

The method and param should match elastic settings: https://www.elastic.co/guide/en/elasticsearch/reference/2.3/index-modules-similarity.html

Parameters:	model – name of the elastic model params – dictionary of params based on elastic