RDFShapes

Query Runtime: Execution time of a query in seconds.
Q-error: Ratio between estimated and true result cardinality
Cost of the query plan: Sum of all the intermediate results

Optimizing SPARQL Queries using Shape Statistics

Kashif Rabbani, Matteo Lissandrini, and Katja Hose.

Proceedings of the 24th International Conference on Extending Database Technology, EDBT 2021 (PDF), Publisher Link

Abstact

With the growing popularity of storing data in native RDF, we witness more and more diverse use cases with complex SPARQL queries. As a consequence, query optimization - and in particular cardinality estimation and join ordering - becomes even more crucial. Classical methods exploit global statistics covering the entire RDF graph as a whole, which naturally fails to correctly capture correlations that are very common in RDF datasets, which then leads to erroneous cardinality estimations and suboptimal query execution plans. The alternative of trying to capture correlations in a fine-granular manner, on the other hand, results in very costly preprocessing steps to create these statistics. Hence, in this paper we propose shapes statistics, which extend the recent SHACL standard with statistic information to capture the correlation between classes and properties. Our extensive experiments on synthetic and real data show that shapes statistics can be generated and managed with only little overhead without disadvantages in query runtime while leading to noticeable improvements in cardinality estimation.

EXPERIMENTS

We experimented on LUBM, YAGO-4 and WATDIV datasets. All experiments are performed on a single machine with Ubuntu 18.04, having 16 cores, 1TB HDD and 256GB RAM.

We have the following metrics to evaluate our approach against others: