SPARQL-Prov

Computing How-Provenance for SPARQL Queries via Query Rewriting

Daniel Hernández, Luis Galárraga, and Katja Hose.

In Proceedings of the VLDB Endowment 2021, Vol. 14, No. 13 ISSN 2150-8097. (PDF), Publisher Link

Abstact

Over the past few years, we have witnessed the emergence of large knowledge graphs built by extracting and combining information from multiple sources. This has propelled many advances in query processing over knowledge graphs, however the aspect of providing provenance explanations for query results has so far been mostly neglected. We therefore propose a novel method, SPARQLprov, based on query rewriting, to compute how-provenance polynomials for SPARQL queries over knowledge graphs. Contrary to existing works, SPARQLprov is system-agnostic and can be applied to standard and already deployed SPARQL engines without the need of customized extensions. We rely on spm-semirings to compute polynomial annotations that respect the property of commutation with homomorphisms on monotonic and non-monotonic SPARQL queries without aggregate functions. Our evaluation on real and synthetic data shows that SPARQLprov over standard engines incurs an acceptable runtime overhead w.r.t. the original query, competing with state-of-the-art solutions for how-provenance computation.

Cite

Hernández, Daniel, Luis Galárraga, and Katja Hose. "Computing how-provenance for SPARQL queries via query rewriting." Proceedings of the VLDB Endowment 14.13 (2021): 3389-3401.

EXPERIMENTS

Benchmarks and Competitors

We evaluated the viability of SPARQLprov to compute provenance for SPARQL query answers by measuring the runtime overhead incurred by the rewritten query w.r.t. the original query on two standard RDF/SPARQL engines, namely Virtuoso and Fuseki. Our evaluation consisted of several experiments, including a comparison with GProM and TripleProv, two state-of-the-art solutions to compute how-provenance for queries. Our experimental benchmarks comprise:

WatDiv: A state-of-the-art benchmark for SPARQL queries. We use it to compare SPARQLprov to the competitors, as well as to evaluate the impact of the reification scheme on the overall runtime.
TPC-H: To evaluate the scalability of SPARQLprov as the amount of data increases, we used this benchmark to generate datasets of different sizes.
Wikidata: We used this dataset to evaluate the runtime overhead of SPARQLprov on a real-world dataset.

Experiments

Impact of reification: We evaluated SPARQLprov's runtime overhead on data encoded with three reification schemes, namely named graphs, the standard reification, and the Wikidata reification. This evaluation was conducted on the WatDiv benchmark.
Scalability evaluation: We studied the scalability of SPARQLprov using the TPC-H benchmark that we used to generate 8 synthetic datasets of increasing size.
Real data: We evaluated SPARQLprov on Wikidata, a real-world RDF knowledge graph.
Comparison with other systems: We compared SPARQLprov on the Watdiv benchmark against other solutions for how-provenance, namely TripleProv and GProM.

System Settings

All the queries were executed in a machine with an AMD EPYC 7281 16-Core Processor, 256GB of RAM, and an 8 TB HDD disk. We use Virtuoso (v.~7.2.5.1) and Fuseki (v.~3.17.0 with TDB1 as storage driver) to test our approach. More details about the experimental settings can be found in our reproducibility archive.

For each experiment, we run all queries once to warm up the engine and then, we run each query load 5 times. For each query, we report the average response time of the executions.

Experimental results

In our experiments, we compared our approach (with Virtuoso and Fuseki as backends) against the two state-of-the-art approaches TripleProv and GProM.

We added an additional 5 non-monotonic queries (O1-O5) involving OPTIONAL. The list of queries is avaible under Downloads at the end of this page.

Figure: Watdiv query execution times per reification scheme and query template. The top dotted line represents the timeout. B: base query, R: reified query, P: provenance query.

Figure: TPC-H execution times per query template.

Figure: Wikidata time overhead.

Figure: Watdiv query execution times per engine and query template. For our approach (denoted as Virtuoso) we report the results obtained with named graphs reification scheme, and for GProM the results obtained with the vertical partition scheme. The top dotted line represents the timeout.

Source Code

The source code of SPARQLprov as well as the utilities to reproduce results of our experiments can be found in our reproducibility archive.

Datasets and Queries

WatDiv: This benchmark uses a syntetic dataset described in the official WatDiv benchmark website. We used the 100 million triples dataset for our experiments. Here are other potentially interesting downloads:
TPC-H: This benchmark uses a syntetic dataset. The tool to create the dataset is available in the official TPC-H benchmark website. This tool generates tables codified as tab-separated values. Our reproducibility scripts take care of translating these data into RDF.
Wikidata: We used the Wikidata dump from January 27, 2020 from Wikimedia foundation. We have created a stable copy of the dataset for long-term preservation concerns.