Star Pattern Fragments

Accessing Knowledge Graphs through Star Patterns

Abstract

Research paper

SPARQL endpoints offer access to a vast amount of interlinked information. While they offer a well-defined interfacefor efficiently retrieving results for complex SPARQL queries, complex query loads can easily overload or crash endpoints asall the computational load of answering the queries resides entirely with the server hosting the endpoint. Recently proposedinterfaces, such as Triple Pattern Fragments, have therefore shifted some of the query processing load from the server to theclient at the expense of increased network traffic in the case of non-selective triple patterns. This paper therefore proposes StarPattern Fragments (SPF), an RDF interface enabling a better load balancing between server and client by decomposing SPARQLqueries into star-shaped subqueries, evaluating them on the server side. Experiments using synthetic data (WatDiv), as well asreal data (DBpedia), show that SPF does not only significantly reduce network traffic, it is also up to two orders of magnitudefaster than the state-of-the-art interfaces under high query load.

EXPERIMENTS

To run the clients, a virtual machine (VM) running all 128 clients concurrently was used. The VM had 128 vCPU cores with a clock speed of 2.5GHz, 64KB L1 cache, 512KB L2 cache, 8192KB L3 cache, and 2TB main memory. Each client was limited to use just one vCPU core and 15GB RAM. The LDF server and the SPARQL endpoint were run, at all times, on a server with 32 vCPU cores, with a clock speed of 3GHz, 64KB L1 cache, 4096KB L2 cache, and 16384KB L3 cache, and a main memory of 128GB

We have the following metrics: We use WatDiv and DBpedia for data and queries for the experiments. We use query loads watdiv-1-star, watdiv-2-stars, watdiv-3-stars, watdiv-paths, watdiv-union (the union query load comprises all the queries in 1-star, 2-stars, 3-stars, and paths), watdiv-btt and dbpedia-lsq. The characteristics of the query loads over DBpedia and watdiv100M can be seen below.

Triple pattern count (#TP)

Join vertex count (#JV)

Join vertex degree (mean)

Result cardinality (#Results)

Triple pattern selectivity (mean)

Triple pattern selectivity (stdev)

We compared SPF to TPF, brTPF, SaGe, Smart-KG, and a SPARQL endpoint. For some configurations, the endpoint became unresponsive and the Smart-KG clients ran out of memory. The full list of these configurations can be seen below.

Endpoint: Unresponsive

watdiv10M
  • watdiv-3_stars: 128 clients
watdiv100M
  • watdiv-1_stars: 16 clients and above
  • watdiv-2_stars: 8 clients and above
  • watdiv-3_stars: 16 clients and above
  • watdiv-paths: 8 clients and above
  • watdiv-btt: 64 clients and above
watdiv1B
  • All configurations
watdiv10B
  • All configurations

Smart-KG: Clients out of memory

watdiv1B
  • watdiv-1_stars: 2 clients and above
  • watdiv-2_stars: 4 clients and above
  • watdiv-3_stars: 8 clients and above
We made experiments for the following size datasets: watdiv10M, watdiv100M, watdiv1B, watdiv10B, and dbpedia.

watdiv-1_stars

watdiv-2_stars

watdiv-3_stars

watdiv-paths

watdiv-union

watdiv-btt

watdiv-1_stars

Throughput over configurations (with timeouts)

Number of timeouts over configurations

watdiv-2_stars

Throughput over configurations (with timeouts)

Number of timeouts over configurations

watdiv-3_stars

Throughput over configurations (with timeouts)

Number of timeouts over configurations

watdiv-paths

Throughput over configurations (with timeouts)

Number of timeouts over configurations

watdiv-union

Throughput over configurations (with timeouts)

Avg. server CPU load over configurations

Number of timeouts over configurations

watdiv-btt

Throughput over configurations (with timeouts)

Avg. server CPU load over configurations

Number of timeouts over configurations

Network Traffic

Number of Requests to the Server over query loads for 64 concurrent clients (w/o timeouts)

Number of Transferred Bytes over query loads for 64 concurrent clients (w/o timeouts)

Number of Requests to the Server over query loads for 64 concurrent clients (w/ timeouts)

Number of Transferred Bytes over query loads for 64 concurrent clients (w/ timeouts)

Performance

Query Execution Time over query loads for 64 concurrent clients

Query Response Time over query loads for 64 concurrent clients

watdiv-1_stars

Throughput over configurations (with timeouts)

Number of timeouts over configurations

watdiv-2_stars

Throughput over configurations (with timeouts)

Number of timeouts over configurations

watdiv-3_stars

Throughput over configurations (with timeouts)

Number of timeouts over configurations

watdiv-paths

Throughput over configurations (with timeouts)

Number of timeouts over configurations

watdiv-union

Throughput over configurations (with timeouts)

Avg. server CPU load over configurations

Number of timeouts over configurations

watdiv-btt

Throughput over configurations (with timeouts)

Avg. server CPU load over configurations

Number of timeouts over configurations

Network Traffic

Number of Requests to the Server over query loads for 64 concurrent clients (w/o timeouts)

Number of Transferred Bytes over query loads for 64 concurrent clients (w/o timeouts)

Number of Requests to the Server over query loads for 64 concurrent clients (w/ timeouts)

Number of Transferred Bytes over query loads for 64 concurrent clients (w/ timeouts)

Performance

Query Execution Time over query loads for 64 concurrent clients

Query Response Time over query loads for 64 concurrent clients

watdiv-1_stars

Throughput over configurations (with timeouts)

Number of timeouts over configurations

watdiv-2_stars

Throughput over configurations (with timeouts)

Number of timeouts over configurations

watdiv-3_stars

Throughput over configurations (with timeouts)

Number of timeouts over configurations

watdiv-paths

Throughput over configurations (with timeouts)

Number of timeouts over configurations

watdiv-union

Throughput over configurations (with timeouts)

Avg. server CPU load over configurations

Number of timeouts over configurations

watdiv-btt

Throughput over configurations (with timeouts)

Avg. server CPU load over configurations

Number of timeouts over configurations

Network Traffic

Number of Requests to the Server over query loads for 64 concurrent clients (w/o timeouts)

Number of Transferred Bytes over query loads for 64 concurrent clients (w/o timeouts)

Number of Requests to the Server over query loads for 64 concurrent clients (w/ timeouts)

Number of Transferred Bytes over query loads for 64 concurrent clients (w/ timeouts)

Performance

Query Execution Time over query loads for 64 concurrent clients

Query Response Time over query loads for 64 concurrent clients

watdiv-1_stars

Throughput over configurations (with timeouts)

Number of timeouts over configurations

watdiv-2_stars

Throughput over configurations (with timeouts)

Number of timeouts over configurations

watdiv-3_stars

Throughput over configurations (with timeouts)

Number of timeouts over configurations

watdiv-paths

Throughput over configurations (with timeouts)

Number of timeouts over configurations

watdiv-union

Throughput over configurations (with timeouts)

Avg. server CPU load over configurations

Number of timeouts over configurations

watdiv-btt

Throughput over configurations (with timeouts)

Avg. server CPU load over configurations

Number of timeouts over configurations

Network Traffic

Number of Requests to the Server over query loads for 64 concurrent clients (w/o timeouts)

Number of Transferred Bytes over query loads for 64 concurrent clients (w/o timeouts)

Number of Requests to the Server over query loads for 64 concurrent clients (w/ timeouts)

Number of Transferred Bytes over query loads for 64 concurrent clients (w/ timeouts)

Performance

Query Execution Time over query loads for 64 concurrent clients

Query Response Time over query loads for 64 concurrent clients

Throughput over configurations

Avg. server CPU load over configurations

Number of timeouts over configurations

Network Traffic

Number of Requests to the Server over query loads for 64 concurrent clients (w/o timeouts)

Number of Transferred Bytes over query loads for 64 concurrent clients (w/o timeouts)

Number of Requests to the Server over query loads for 64 concurrent clients (w/ timeouts)

Number of Transferred Bytes over query loads for 64 concurrent clients (w/ timeouts)

Performance

Query Execution Time over query loads for 64 concurrent clients

Query Response Time over query loads for 64 concurrent clients

SPECIFICATION

SPF is an extension of brTPF, and thus also TPF. The hypermedia specification is therefore adapted from TPF found on this link. Expand below to view the hypermedia specification.

Definition

A Star Pattern Fragment is a Linked Data Fragment of a dataset that consists of three parts:

data
A Star Pattern Fragment MUST contain all triples of the dataset that match a given star pattern “?s ?p1 ?o1 . ?s ?p2 ?o2 . ... . ?s ?pn ?on”.
A Star Pattern Fragment MAY additionally contain other triples of the dataset.
metadata
A Star Pattern Fragment MUST contain a triple with a void:triples predicate that expresses the estimated total number of matches for the star pattern.
A Star Pattern Fragment MAY contain additional metadata.
hypermedia controls
A Star Pattern Fragment MUST contain hypermedia controls that allow retrieval of any other Star Pattern Fragment of the same dataset. This MUST be provided as a form that allows to choose subject or any predicate or object of the selector's star pattern.
A Star Pattern Fragment MAY contain additional hypermedia controls. In particular, the IRIs of the entities of the data, metadata, and control triples SHOULD be dereferenceable.

A Star Pattern Fragment of a dataset is fully determined by its star pattern selector. A star pattern selector consists of a subject component and an arbitrary number of predicate and object components. The number of predicate and object components MUST be the same. Henceforth we will refer to pi as the i'th predicate and oi as the i'th object. The subject MUST either be a variable or an IRI; all predicates MUST either be variables or IRIs; the objects MUST either be variables, IRIs, or literals. These components MUST NOT be blank nodes.

A Star Pattern Fragment is considered empty if it does not contain any data triples (regardless of metadata and controls).

Star Pattern Fragments are not bound to a specific syntax, since the data, metadata, and controls can be represented in different ways. The server MUST, however, support at least one RDF-based representation. Servers MUST indicate the corresponding MIME type when responding with a Star Pattern Fragment, so clients can correctly parse it.

The server SHOULD choose to group triples that combined answer the entire star pattern when answering answering a star pattern request so clients can more efficiently parse the results.

For RDF syntaxes without multiple graph support (such as Turtle or N-Triples), the data, metadata, and controls MUST necessarily be serialized to the same graph. For RDF syntaxes with multiple graph support (such as JSON-LD, TriG or N-Quads), the data MUST be serialized to the default graph and the metadata and controls MUST be serialized to one or multiple non-default graphs.

NOTE: A Star Pattern Fragments server MAY be backwards compatible with a Triple Pattern Fragments, and/or a Bindings-Restricted Triple Pattern Fragments server, in that if a star pattern with only one triple pattern is received, the server MAY choose to invoke the triple pattern selector or bindings-restricted triple pattern selector. In this case, the fragment is still obtained by issuing a star pattern fragments request, and thus it follows the same specification for hypermedia controls and metadata, i.e. the ones described in this document.

Data

The data of a Star Pattern Fragment is obtained by selecting all triples of a dataset that match the fragment's star pattern selector. These triples SHOULD be ordered in some consistent way, such that Star Pattern Fragments can be paged consistently. Data triples SHOULD be grouped such that each group of data triples represent a full answer to the star pattern. Data triples MUST NOT contain blank nodes, since these are scoped to a document, whereas clients need to be able to consistently refer to star components across different Star Pattern Fragments. The RECOMMENDED way of eliminating blank nodes is skolemization.

If the RDF syntax supports multiple graphs, data triples MUST be serialized to the default graph.

Metadata

Each Star Pattern Fragment, and each page of a Star Pattern Fragment, MUST contain the estimated total number of results (grouped triple patterns that answer the star pattern) that match the fragment's selector. This MUST be expressed as a triple with the following components:

subject
the IRI of the fragment
predicate
the void:triples predicate
object
an integer literal expressing the estimated total number of matching results

The estimate MUST be a non-negative, finite, integer number with the following properties:

  • If no stars match the selector, the estimate MUST be zero.
  • If one or more stars match the selector, the estimate MUST NOT be zero.
  • The estimate SHOULD be as close as possible to the actual number of matches.
  • If the number of matching stars is smaller than the number of items per page, the estimate SHOULD be exactly the number of matching stars.

The metadata MAY additionally contain variations of the above triple. For instance, it is RECOMMENDED to add a triple with the same subject and object and the hydra:totalItems predicate.

If the RDF syntax supports multiple graphs, metadata triples MUST be serialized to a non-default graph. This non-default graph SHOULD be explicitly related to the Star Pattern Fragment (for instance, using the foaf:primaryTopic predicate), so clients can interpret what resource this metadata belongs to. This graph MAY be the same as the graph containing the hypermedia controls.

Hypermedia controls

Each Star Pattern Fragment, and each page of a Star Pattern Fragment, MUST contain a hypermedia control that can generate the IRI of each other Star Pattern Fragment of the same dataset.

This control MUST act as a function that accepts the following parameters:

subject
A variable or a constant IRI
triples
A positive integer that describes the number og triple patterns in the star pattern
stars
n grouped predicates/objects, where n is the number of triple patterns in the star pattern; predicates are variables or constant IRIs, objects are variables or constant IRIs or literals
It MUST then map these parameters to the IRI of the dataset's Star Pattern Fragment whose selector has the given parameter values.

This control MUST be expressed as a form in the Hydra Core Vocabulary using triples with the following structure:


<http://example.org/dbpedia#dataset>
	a
	void:Dataset , hydra:Collection ;
	void:subset <http://example.org/dbpedia> ;
	hydra:search   [ hydra:mapping  [ hydra:property rdf:subject ; hydra:variable "s" ] ;
			 hydra:mapping  [ hydra:property xsd:integer ; hydra:variable "triples" ] ;
			 hydra:mapping  [ hydra:property xsd:string ; hydra:variable "star" ] ;
			 hydra:mapping  [ hydra:property xsd:string ; hydra:variable "values" ] ;
			 hydra:template "http://example.org/dbpedia{?s,triples,star,values}" ] .

The IRI template to retrieve Star Pattern Fragments of the dataset is http://example.org/dbpedia{?s,triples,star,values}. All of these MUST be adjusted to fit the configuration of a specific server. Note that the form MUST be attached to the dataset, as the form filters the dataset and not the fragment, and the fragment MUST explicitly be listed as a subset of the dataset, in order to indicate the relation between the two.

Paging

Star Pattern Fragments SHOULD be paged to avoid overly large responses. A page of a Star Pattern Fragment consists of the following three parts:

data / selector
The page MUST contain a subset of data of the corresponding Star Pattern Fragment. The data of a fragment SHOULD be distributed over pages of a given page size n by dividing an ordered list of matching triples in chunks of size n.
Each star SHOULD only occur on one page of any given fragment.
The page MAY additionally contain other stars of the dataset.
metadata
The page MUST be linked to its corresponding Star Pattern Fragment using hydra:view.
The page SHOULD contain all metadata of the fragment. In particular, it MUST contain a triple with a void:triples predicate that expresses the estimated total number of matches for the fragment's star pattern.
The page MAY contain additional metadata.
hypermedia controls
A page SHOULD contain all hypermedia controls of the fragment. In particular, it MUST contain those controls that allow retrieval of any Star Pattern Fragment of the dataset.
If a previous page directly precedes the page, this page MUST link to it using hydra:previous. The previous page SHOULD NOT be empty.
If a next page directly follows the page, this page MUST link to it using hydra:next. The next page SHOULD NOT be empty.
The page MAY contain additional hypermedia controls.

A page is considered empty if it does not contain any data triples (regardless of metadata and controls).

Here is an the IRI template of a Star Pattern Fragment:

http://example.org/dbpedia{?s,triples,star,values}

Here is an example of an IRI for a Star Pattern Fragment:

http://example.org/dbpedia?triples=3&star=[p1,dbo:country;o1,dbr:norway;p2,dbo:award;p3,dbo:birthDate]

INSTALLATION

The SPF client is available in Node.js or Java 8, and the server is available in Java 8. We extended the TPF server to support SPF requests. You can view the sources on GitHub. Note, that it was implemented only as a prototype for testing the performance and availability of the server and client. The server can be run as a Jetty application. The client can be run as a standalone .js or .jar file or installed into your own projects as a library.

Requirements

Java 8 or newer.

Installation

  1. Get and build the sources from GitHub.
  2. Include the SPF file as a library to your project.
  3. Use SparqlQueryProcessor class to process queries using the SPF client.

DOWNLOADS