SHapes ExtrActor from very large Knowledge Graphs


We demonstrate SHACTOR, a system for extracting and analyzing validating shapes from very large Knowledge Graphs (KGs). Shapes represent a specific form of data patterns, akin to schemas for entities. Standard shape extraction approaches are likely to produce thousands of shapes, and some of those represent spurious constraints extracted due to the presence of erroneous data in the KG. Given a KG having tens of millions of triples and thousands of classes, SHACTOR parses the KG using our efficient and scalable shapes extraction algorithm and outputs SHACL shapes constraints. The extracted shapes are further annotated with statistical information regarding their support in the graph, which allows to identify both erroneous and missing triples in the KG. Hence, SHACTOR can be used to extract, analyze, and clean shape constraints from very large KGs. Furthermore, it enables the user to also find and correct errors by automatically generating SPARQL queries over the graph to retrieve nodes and facts that are the source of the spurious shapes and to intervene by amending the data.


Rabbani, Kashif; Lissandrini, Matteo; and Hose, Katja. SHACTOR: Improving the Quality of Large-Scale Knowledge Graphs with Validating Shapes. In Proceedings of the 2023 International Conference on Management of Data, (SIGMOD-Companion '23), June 18-23, 2023, Seattle, WA, USA.

SHACTOR's Presentation and Demonstration video

--With a use-case of real-world DBpedia knowledge Graph--

Knowledge graphs (KGs) are created by curating data from various heterogeneous data sources, and then various kinds of analytics are performed over them by executing SPARQL queries. In order to validate the data in the KGs and avoid the insertion of erroneous data, validating shapes such as SHACL can be defined. The most reliable way is to define them manually. However, KGs are very large in size and require too much manual effort to define shape constraints manually. A common approach is to derive/infer shape constraints from data, but this implies a circular dependency because now you are trusting the content of the KG. Therefore, if we know that the KG has errors, we need a way to extract reliable shapes from the noisy graph. We present SHACTOR as a solution to this problem. It uses our scalable shapes extraction approach to extract SHACL shapes from very large KGs ensuring the quality of the output shapes using the concept of support and confidence.

Click here to watch the video on ACM's official website

Source Code (Open Source)

Visit our GitHub repository.