Knowledge graphs have become vital sources for semantic search and provide users with precise answers to their information needs. Knowledge graphs often consist of billions of facts, typically encoded in the form of RDF triples. In most cases, these facts are extracted automatically and can thus be susceptible to errors. For many applications, it can therefore be very useful to complement knowledge graph facts with textual evidence. For instance, it can help users make informed decisions about the validity of the facts that are returned as part of an answer to a query. It can also help users find more contextual information about the facts that go beyond the information contained in the knowledge graph itself. In this paper, we therefore propose FacTify, an approach that given a knowledge graph and a text corpus, retrieves the top-k most relevant textual passages for a given set of facts. Since our goal is to retrieve short passages, we develop a set of IR models combining exact matching through the Okapi BM25 model with semantic matching using word embeddings. To evaluate our approach, we build an extensive benchmark consisting of facts extracted from a large knowledge graph (YAGO) and text passages retrieved from a large text corpus (Wikipedia). Our benchmark has been assessed through crowdsourcing and is publicly available. Our experimental results demonstrate the effectiveness of our approach in retrieving textual evidence for knowledge graph facts, compared to many baseline approaches.
Knowledge subgraphs (i.e., queries) are extracted from YAGO. You can find the queries in the file queries.csv. The file is comma separated, with the following fields qid, triple, Keywords. If the query consists of multiple triples, the triples are separated by the ";" character. In the final column the queries are provided in natural language form.
The text corpus is used to retrieve textual passages for the queries. Passages are extracted from Wikipedia using the dump file from 2018-08-01 and building overlapping passages of 3 consecutive sentences. The sentences are detected using Stanford NLP Core version 3.9.1. The corpus after segmentation is available for download, the compressed file (using 7zip) two files, wiki.text and passids.txt. The first file simply contains the text of passages, the second file is for annotating the passages and stores the passageid, offset for the passage in wiki.text file and end offset. An example code for reading the passages is provided, the below python code shows how to print all passages to standard output.
from passage_reader import PassageReader # Buffer size is the number of passages to store in memory, this amounts to roughly 300MB passReader = PassageReader("wiki.text", "passids.txt", buffer_size=1000000) with passReader as preader: for (passid, text) in preader: print(passid) print(text)
A new algorithm targeting this task can use the queries and the passages to produce ranked results for each query. If the provided evaluation script is to be used, the output file format should contain query id, passage id, text of passage and relevancy score in tab separated file. The file should be sorted with respect to query id and the relevancy score.
1 Nancy Lincoln#31 They had three children: Sarah Lincoln (February 10, 1807 January 20, 1828). Abraham Lincoln (February 12, 1809 April 15, 1865). Thomas Lincoln, Jr. (died in infancy, 1812). 16.07022454208584 1 Nancy Lincoln#32 Abraham Lincoln (February 12, 1809 - April 15, 1865). Thomas Lincoln, Jr. (died in infancy, 1812). The young family lived in what was then Hardin County, Kentucky (now LaRue). 15.230758500842862 1 Nancy Lincoln#30 A record of their marriage license is held at the county courthouse. They had three children: Sarah Lincoln (February 10, 1807 - January 20, 1828). Abraham Lincoln (February 12, 1809 - April 15, 1865). 14.34208330071841 1 1865 in the United States#99 April 15 - Abraham Lincoln, 16th President of the United States from 1861 to 1865 (born 1809). April 26 - John Wilkes Booth, actor and assassin of Abraham Lincoln (born 1838). May 20 - William K. Sebastian, U.S. Senator from Arkansas from 1848 to 1861 (born 1812). 14.306765387697894
The evaluation script creates a Spreadsheet file containing NDCG, MRR and Precision values. The results in the article can be reproduced using the script reproduce.sh, which automatically downloads required files and executes evaluation.