nordlys.core.retrieval.elastic module

Elastic

Utility class for working with Elasticsearch. This class is to be instantiated for each index.

Indexing usage

To create an index, first you need to define field mappings and then build the index. The sample code for creating an index is provided at nordlys.core.retrieval.toy_indexer.

Retrieval usage

The following statistics can be obtained from this class:

Efficiency considerations

Authors:Faegheh Hasibi, Krisztian Balog
class nordlys.core.retrieval.elastic.Elastic(index_name)[source]

Bases: object

ANALYZER_STOP = 'stop_en'
ANALYZER_STOP_STEM = 'english'
BM25 = 'BM25'
DOC_TYPE = 'doc'
FIELD_CATCHALL = 'catchall'
FIELD_ELASTIC_CATCHALL = '_all'
SIMILARITY = 'sim'
add_doc(doc_id, contents)[source]

Adds a document with the specified contents to the index.

Parameters:
  • doc_id – document ID
  • contents – content of document
add_docs_bulk(docs)[source]

Adds a set of documents to the index in a bulk.

Parameters:docs – dictionary {doc_id: doc}
analyze_query(query, analyzer='stop_en')[source]

Analyzes the query.

Parameters:
  • query – raw query
  • analyzer – name of analyzer
static analyzed_field(analyzer='stop_en')[source]

Returns the mapping for analyzed fields.

For efficiency considerations, term positions are not stored. To store term positions, change "term_vector": "with_positions_offsets"

Parameters:analyzer – name of the analyzer; valid options: [ANALYZER_STOP, ANALYZER_STOP_STEM]
avg_len(field)[source]

Returns average length of a field in the collection.

coll_length(field)[source]

Returns length of field in the collection.

coll_term_freq(term, field, tv=None)[source]

Returns collection term frequency for the given field.

create_index(mappings, model='BM25', model_params=None, force=False)[source]

Creates index (if it doesn’t exist).

Parameters:
  • mappings – field mappings
  • model – name of elastic search similarity
  • model_params – name of elastic search similarity
  • force – forces index creation (overwrites if already exists)
delete_index()[source]

Deletes an index.

doc_count(field)[source]

Returns number of documents with at least one term for the given field.

doc_freq(term, field, tv=None)[source]

Returns document frequency for the given term and field.

doc_length(doc_id, field)[source]

Returns length of a field in a document.

get_doc(doc_id, fields=None, source=True)[source]

Gets a document from the index based on its ID.

Parameters:
  • doc_id – document ID
  • fields – list of fields to return (default: all)
  • source – return document source as well (default: yes)
get_field_stats(field)[source]

Returns stats of the given field.

get_fields()[source]

Returns name of fields in the index.

get_mapping()[source]

Returns mapping definition for the index.

get_settings()[source]

Returns index settings.

static notanalyzed_field()[source]

Returns the mapping for not-analyzed fields.

static notanalyzed_searchable_field()[source]

Returns the mapping for not-analyzed fields.

num_docs()[source]

Returns the number of documents in the index.

num_fields()[source]

Returns number of fields in the index.

search(query, field, num=100, fields_return='', start=0)[source]

Searches in a given field using the similarity method configured in the index for that field.

Parameters:
  • query – query string
  • field – field to search in
  • num – number of hits to return (default: 100)
  • fields_return – additional document fields to be returned
  • start – starting offset (default: 0)
Returns:

dictionary of document IDs with scores

search_complex(body, num=10, fields_return='', start=0)[source]

Supports complex structured queries, which are sent as a body field in Elastic search. For detailed information on formulating structured queries, see the official instructions. Below is an example to search in two particular fields that each must contain a specific term.

Example:
# [explanation of the query]
term_1 = "hello"
term_2 = "world"
body = {
    "query": {
        "bool": {
            "must": [
                    {
                "match": {"title": term_1}
                    },
                    {
                "match_phrase": {"content": term_2}
                    }
                    ]
                }
            }
        }
Parameters:
  • body – query body
  • field – field to search in
  • num – number of hits to return (default: 100)
  • fields_return – additional document fields to be returned
  • start – starting offset (default: 0)
Returns:

dictionary of document IDs with scores

term_freq(doc_id, field, term)[source]

Returns frequency of a term in a given document and field.

term_freqs(doc_id, field, tv=None)[source]

Returns term frequencies of all terms for a given document and field.

update_similarity(model='BM25', params=None)[source]

Updates the similarity function “sim”, which is fixed for all index fields.

Parameters:
  • model – name of the elastic model
  • params – dictionary of params based on elastic