Data components¶

Data sources¶

DBpedia¶

We use DBpedia as the main underlying knowledge base. In particular, we prepared dumps for DBpedia version 2015-10.

DBpedia is distributed, among other formats, as a set of .ttl.bz2 files. We use a selection of these files, as defined in data/config/dbpedia2mongo.config.json. You can download these files from DBpedia Website directly or running ./scripts/download_dbpedia.sh from the main Nordlys folder. Running the script will place the downloaded files under data/raw-data/dbpedia-2015-10/.

We also provide a minimal sample from DBpedia under data/dbpedia-2015-10-sample, which can be used for testing/development in a local environment.

FACC¶

The Freebase Annotations of the ClueWeb Corpora (FACC) is used for building entity surface form dictionary. You can download the collection from its main Website. and further process it using our scripts. Alternatively, you can download the preprocessed data from our server. Check the README file under data/raw-data/facc for detailed information.

Word2Vec¶

Word2Vec vectors (300D) trained on Google News corpus, which canbe dowloaded from the its Website. Check the README file under data/raw-data/word2vec for detailed information.

MongoDB collections¶

The table below provides an overview of the MongoDB collections that are used by the different services.

Name	Description	EC	ER	EL	TTI
`dbpedia-2015-10`	DBpedia	+¹	+²		+³
`fb2dbp-2015-10`	Mapping from Freebase to DBpedia IDs	+⁴		+
`surface_forms_dbpedia`	Entity surface forms from DBpedia	+⁵		+⁶
`surface_forms_facc`	Entity surface forms from FACC	+⁷		+
`word2vec-googlenews`	Word2vec trained on Google News				+⁸

¹ for entity ID-based lookup and DBpedia2Freebase mapping functionalities
² only for building the Elastic entity index; not used later in the retrieval process
³ for entity-centric TTI method
⁴ for Freebase2DBpedia mapping functionality
⁵ for entity surface form lookup from DBpedia
⁶ for all EL methods other than “commonness”
⁷ for entity surface form lookup from FACC
⁸ for LTR TTI method

Building MongoDB sources from raw data¶

To build the above tables from raw data (as opposed to the provided dumps), first make sure that you have the raw data files.

For DBpedia, these may be downloaded using ./scripts/download_dbpedia.sh
For the FACC and Word2vec data files, execute ./scripts/download_raw.sh

To load DBpedia to MongoDB, run

python -m nordlys.core.data.dbpedia.dbpedia2mongo data/config/dbpedia-2015-10/dbpedia2mongo.config.json

Note

To use the DBpedia 2015-10 sample shipped with Nordlys, as opposed to the full collection, change the value path to data/raw-data/dbpedia-2015-10_sample/ in dbpedia2mongo.config.json.

Elastic indices¶

Name	Description	ER	EL	TTI
`dbpedia_2015_10`	DBpedia index	+	+¹	+²
`dbpedia_2015_10_uri`	DBpedia URI-only index	+³
`dbpedia_2015_10_types`	DBpedia types index			+⁴

¹ for all EL methods other than “commonness”
² only for entity-centric TTI method
³ only for ELR entity ranking method
⁴ only for type-centric TTI method