nordlys.core.retrieval.scorer module

Scorer

Various retrieval models for scoring a individual document for a given query.

Authors:Faegheh Hasibi, Krisztian Balog
class nordlys.core.retrieval.scorer.Scorer(elastic, query, params)[source]

Bases: object

Base scorer class.

SCORER_DEBUG = 0
static get_scorer(elastic, query, config)[source]

Returns Scorer object (Scorer factory).

Parameters:
  • elastic – Elastic object
  • query – raw query (to be analyzed)
  • config – dict with models parameters
class nordlys.core.retrieval.scorer.ScorerLM(elastic, query, params)[source]

Bases: nordlys.core.retrieval.scorer.Scorer

Language Model (LM) scorer.

DIRICHLET = 'dirichlet'
JM = 'jm'
static get_dirichlet_prob(tf_t_d, len_d, tf_t_C, len_C, mu)[source]

Computes Dirichlet-smoothed probability. P(t|theta_d) = [tf(t, d) + mu P(t|C)] / [|d| + mu]

Parameters:
  • tf_t_d – tf(t,d)
  • len_d|d|
  • tf_t_C – tf(t,C)
  • len_C|C| = sum_{d in C} |d|
  • mu – mu
Returns:

Dirichlet-smoothed probability

static get_jm_prob(tf_t_d, len_d, tf_t_C, len_C, lambd)[source]

Computes JM-smoothed probability. p(t|theta_d) = [(1-lambda) tf(t, d)/|d|] + [lambda tf(t, C)/|C|]

Parameters:
  • tf_t_d – tf(t,d)
  • len_d|d|
  • tf_t_C – tf(t,C)
  • len_C|C| = sum_{d in C} |d|
  • lambd – lambda
Returns:

JM-smoothed probability

get_lm_term_prob(doc_id, field, t, tf_t_d_f=None, tf_t_C_f=None)[source]

Returns term probability for a document and field.

Parameters:
  • doc_id – document ID
  • field – field name
  • t – term
Returns:

P(t|d_f)

get_lm_term_probs(doc_id, field)[source]

Returns probability of all query terms for a document and field; i.e. p(t|theta_d)

Parameters:
  • doc_id – document ID
  • field – field name
Returns:

dictionary of terms with their probabilities

score_doc(doc_id)[source]

Scores the given document using LM. p(q|theta_d) = sum log(p(t|theta_d))

Parameters:doc_id – document id
Returns:LM score
class nordlys.core.retrieval.scorer.ScorerMLM(elastic, query, params)[source]

Bases: nordlys.core.retrieval.scorer.ScorerLM

Mixture of Language Model (MLM) scorer.

Implemented based on:
Ogilvie, Callan. Combining document representations for known-item search. SIGIR 2003.
get_mlm_term_prob(doc_id, t)[source]

Returns MLM probability for the given term and field-weights. p(t|theta_d) = sum(mu_f * p(t|theta_d_f))

Parameters:
  • lucene_doc_id – internal Lucene document ID
  • t – term
Returns:

P(t|theta_d)

get_mlm_term_probs(doc_id)[source]

Returns probability of all query terms for a document; i.e. p(t|theta_d)

Parameters:doc_id – internal Lucene document ID
Returns:dictionary of terms with their probabilities
score_doc(doc_id)[source]

Scores the given document using MLM model. p(q|theta_d) = sum log(p(t|theta_d))

Parameters:doc_id – document ID
Returns:MLM score of document and query
class nordlys.core.retrieval.scorer.ScorerPRMS(elastic, query, params)[source]

Bases: nordlys.core.retrieval.scorer.ScorerLM

PRMS scorer.

get_mapping_prob(t, coll_termfreq_fields=None)[source]
Computes PRMS field mapping probability.
p(f|t) = P(t|f)P(f) / sum_f’(P(t|C_{f’_c})P(f’))
Parameters:
  • t – str
  • coll_termfreq_fields – {field: freq, …}
Returns:

a dictionary {field: prms_prob, …}

get_mapping_probs()[source]

Gets (cached) mapping probabilities for all query terms.

get_total_field_freq()[source]

Returns total occurrences of all fields

score_doc(doc_id)[source]

Scores the given document using PRMS model.

Parameters:
  • doc_id – document id
  • lucene_doc_id – internal Lucene document ID
Returns:

float, PRMS score of document and query