A tidyverse lover’s intro to RDF

Carl Boettiger

2024-08-05

In the world of data science, RDF is a bit of an ugly duckling. Like XML and Java, only without the massive-adoption-that-refuses-to-die part. In fact RDF is most frequently expressed in XML, and RDF tools are written in Java, which help give RDF has the aesthetics of steampunk, of some technology for some futuristic Semantic Web1 in a toolset that feels about as lightweight and modern as iron dreadnought.

But don’t let these appearances deceive you. RDF really is cool. If you’ve ever gotten carried away using tidyr::gather to make everything into one long table, you may have noticed you can just about always get things down to about three columns, as we see with an obligatory mtcars data example for tidyr::gather:

library(rdflib)
library(dplyr)
library(tidyr)
library(tibble)
library(jsonld)
car_triples <- 
mtcars %>% 
  rownames_to_column("Model") %>% 
  gather(attribute,measurement, -Model)

If you like long tables like this, RDF is for you. This layout isn’t “Tidy Data,” where rows are observations and columns are variables, but it is damn useful sometimes. This format is very liquid, easy to reshape into other structures – so much so that tidyr::gather was originally known as melt in the reshape2 package. It’s also a good way to get started thinking about RDF.

It’s all about the triples

Looking at this table closely, we see that each row is reduced to the most elementary statement you can make from the data. A row no longer tells you the measurements (observations) all attributes (variables) of a given species (key), instead, you get just one fact per row, Mazda RX4 gets a mpg measurement of 21.0. In RDF-world, we think of these three-part statements as something very special, which we call triples. RDF is all about these triples.

The first column came from the row names in this case, the Model of car. This acts serves as a key to index the data.frame, i.e. the subject being described. The next column is the variable (also called attribute or property) being measured, (that is, column names, other than the key column(s), from the tidy data), called the property or predicate in RDF-speak (slash grammar-school jargon). The third column is the actual value measured, more object of the predicate. Call it key-property-value or subject-predicate-object, these are our triples. We can represent just about any data in fully elementary manner.

Table 1: The many names for triples.
       
RDF subject predicate object
JSON object property value
spreadsheet row id column name cell
data.frame key variable measurement
data.frame key attribute value

Table 1 summarizes the many different names associated with triples. The first naming convention is the terminology typically associated with RDF. The second set are terms typically associated with JSON data, while the remaining are all examples in tabular or relational data structures.

Subject URIs

Using row names as our subject was intuitive but actually a bit sloppy. tidyverse lovers know that tidyverse doesn’t like rownames, they aren’t tidy and have a way of causing trouble. Of course, we made rownames into a proper column to use gather, but we could have taken this one step further. In true tidyverse fashion, this rownames-column is really just one more variable we can observe, one more attribute of the thing we were describing: say, thing A (Car A) is a car_model_name as Mazda RX4 and thing A also has mpg of 21. We can accomplish such a greater level of abstraction by keeping the Model as just another variable the row ids themselves as the key (i.e. the subject) of our triple:

car_triples <- 
mtcars %>% 
  rownames_to_column("Model") %>% 
  rowid_to_column("subject") %>% 
  gather(predicate, object, -subject)

This is identical to a gather of all columns, where we have just made the original row ids an explicit column for reference (diligent reader will recognize we would need this information to reverse the operation and spread the data back into it’s wide form; without it, our transformation is lossy and irreversible). Our subject column now consists only of simple numeric id’s, while we have gained an additional triple for every row in the original data which states Model of each id number (e.g. 1 is Model Mazda RX4). Okay, now you’re probably thinking: “wait a minute, 1 is not a very unique or specific key, surely that will cause trouble,” and you’d be right. For instance, if we performed the same transformation on the iris data, we get triples in the exact same format, ready to bind_rows:

iris_triples <- iris %>%
  rowid_to_column("subject") %>%
  gather(key = predicate, value = object, -subject)

but in the iris data, 1 corresponds to the first individual Iris flower in the measurement data, and not a Mazda RX4. If we don’t want to get confused, we’re going to need to make sure our identifiers are unique: not just kind of unique, but unique in the World wide. And what else is unique world-wide? Yup, you guessed it, we are going to use URLs for our subject identifiers, just like the world wide web. Think of this as a clever out-sourcing to the whole internet domain registry service. Here, we’ll imagine registering each of these example datasets with a separate base URL, so instead of a vague 1 to identify the first observation in the iris example data, we’ll use the URL http://example.com/iris#1, which we can now distinguish from http://example.com/mtcars#1 (and if you’re way ahead of me, yes, we’ll have more to say about URI vs URL and the use of blank nodes in just a minute). For example:

iris_triples <- iris %>%
  rowid_to_column("subject") %>%
  mutate(subject = paste0("http://example.com/", "iris#", subject)) %>%
  gather(key = predicate, value = object, -subject)

Predicate URIs

A slightly more subtle version of the same problem can arise with our predicates. Different tables may use the same attribute (i.e. originally, a column name of a variable) for different things – the attribute labeled cyl means “number of cylinders” in mtcars data.frame, but could mean something very different in different data. Luckily we’ve already seen how to make names unique in RDF turn them into URLs.

iris_triples <- iris %>%
  rowid_to_column("subject") %>%
  mutate(subject = paste0("http://example.com/", "iris#", subject)) %>%
  gather(key = predicate, value = object, -subject) %>%
  mutate(predicate = paste0("http://example.com/", "iris#", predicate))

At this point the motivation for the name “Linked Data” is probably becoming painfully obvious.

Datatype URIs

One more column to go! But wait a minute, the object column is different, isn’t it? These measurements don’t suffer from the same ambiguity – after all, there is no confusion if a car has 4 cylinders and an iris has 4 mm long sepals. However, a new issue has arisen in the data type (e.g. string, boolean, double, integer, dateTime, etc). A close look reveals our object column is encoded as a character and not numeric – how’d that happen? tidyr::gather has coerced the whole column into character strings because some of the values, that is, the Species names in iris and the Model names in mtcars, are text strings (and it couldn’t exactly coerce them into integers). Perhaps this isn’t a big deal – we can often guess the type of an object just by how it looks (so-called Duck typing, because if it quacks like duck…). Still, being explicit about data types is a Good Thing, so fortunately there’s an explicit way to address this too … oh no … not … yes … more URLs!

Luckily we don’t have to make up example.com URLs this time because there’s already a well-established list of data types widely used across the internet that were originally developed for use in XML (I warned you) Schemas, listed in see the W3C RDF DataTypes. As the standard shows, familiar types string, double, boolean, integer, etc are made explicit using the XML Schema URL: http://www.w3.org/2001/XMLSchema#, followed by the type; so an integer would be `http://www.w3.org/2001/XMLSchema#integer, a character string http://www.w3.org/2001/XMLSchema#string etc.

Because this case is a little different, the URL is attached directly after the object value, which is set off by quotes, using the symbol ^^ (I dunno, but I think two duck feet), such that 5.1 becomes "5.1"^^http://www.w3.org/2001/XMLSchema#double. Wow2. Most of the time we won’t have to worry about the type, because, if it quacks…

Triples in rdflib

So far, we have explored the concept of triples using familiar data.frame structures, but haven’t yet introduced any rdflib functions. Though we’ve been thinking of RDF data in this explicitly tabular three-column structure, that is really just one potentially convenient representation. Just as the same tabular data can be represented in a data.frame, written to disk as a .csv file, or stored in a database (like MySQL or PostgreSQL), so it is for RDF to even greater degree. We have separate abstractions for the information itself compared to how it is represented.

To take advantage of this abstraction, rdflib introduces an rdf class object. Depending on how this is initialized, this could utilize storage in memory (the default), on disk, or potentially in an array of different databases, (including relational databases like PostgreSQL and rdf-specific ones like Virtuoso, depending on how the underlying redland library is compiled – a topic beyond our scope here). Here, we simply initialize an rdf object using the default in-memory storage:

rdf <- rdf()

To add triples to this rdf object (often called an RDF Model or RDF Graph), we use the function rdf_add, which takes a subject, predicate, and object as arguments, as we have just discussed. A datatype URI can be inferred from the R type used for the object (e.g. numeric, integer, logical, character, etc.)

base <- paste0("http://example.com/", "iris#")

rdf %>% 
  rdf_add(subject = paste0(base, "obs1"), 
          predicate = paste0(base, "Sepal.Length"), 
          object = 5.1)

rdf

The result is displayed as a triple discussed above. This is technically an example of the nquad notation we will see later. Note the inferred datatype URI.

Dialing back the ugly

This gather thing started well, but now are data is looking pretty ugly, not to mention cumbersome. You have some idea why RDF hasn’t taken data science by storm, and we haven’t even looked at how ugly this gets when you write it in the RDF/XML serialization yet! On the upside, we’ve introduced most of the essential concepts that will let us start to work with data as triples. Before we proceed further, we’ll take a quick look at some of the options for expressing triples in different ways, and also introduce some of the different serializations (ways of representing in text) frequently used to express these triples.

Prefixes for URIs

Long URL strings are one of the most obvious ways that what started off looking like a concise, minimal statement got ugly and cumbersome. Borrowing from the notion of Namespaces in XML, most RDF tools permit custom prefixes to be declared and swapped in for longer URLs. A prefix is typically a short string3 followed by a : that is used in place of the shared root URL. For instance, we might use the prefix iris:Sepal.Length and iris:Sepal.Width where iris: is defined to mean http://example.com/iris# in our example above.

URI vs URL

While I’ve referred to these things as URLs, (uniform resource locator, aka web address) technically they can be a broader class of things known as URIs (uniform resource identifier). In addition to including anything that is a URL, URIs include things which are not URLs, like urn:isbn:0-486-27557-4 or urn:uuid:aac06f69-7ec8-403d-ad84-baa549133dce, which are URNs: unique resource numbers in some numbering scheme (e.g. book ISBN numbers, or UUIDs), neither of which are URLs but nonetheless enjoy the same globally unique property.

Blank nodes

Sometimes we do not need a globally unique identifier, we just want a way to refer to a node (e.g. subject, and sometimes an object) uniquely in our document. This is the role of a blank node (do follow the link for a better overview). These are frequently denoted with the prefix _:, e.g. we could have replaced the sample IDs as _:1, _:2 instead of the URLs such as http://example.com/iris#1 etc. Note that RDF operations need not preserve the actual string pattern in a blank ID name, it means the exact same thing if we replace all the _:1s with _:b1 and _:2 with _:b2, etc.

In librdf we can get a blank node by passing an empty string or character string that is not a URI as the subject. Here we also use a URI that isn’t a URL as predicate:

rdf <- rdf()
rdf %>% rdf_add("",   
                "iris:Sepal.Length", 
                object = 5.1)
rdf

Note that we get a blank node, _: with a randomly generated string.

Triple notation: nquads rdfxml, turtle, and nquads

So far we have relied primarily on a three-column tabular format to represent our triples. We have also seen the default print format used for the rdf method, known as N-Quads above, which displays a bare, space-separated triple, possibly with a datatype URI attached to the object. The line ends with a dot, which indicates this is part of the same local triplestore (aka RDF graph or RDF Model). Technically this could be another URI indicating a unique global address for the triplestore in question.

We can serialize any rdf object out to a file in this format with the rdf_serialize() function, e.g.

rdf_serialize(rdf, "rdf.nq", format = "nquads")

Just as each of these formats can be serialized with rdf_serialize(), each can be read by rdflib using the function rdf_parse():

doc <- system.file("extdata/example.rdf", package="redland")
rdf <- rdf_parse(doc, format = "rdfxml") 
rdf

N-Quads are convenient in that each triple is displayed on a unique line, and the format supports the blank node and Datatype URIs in the manner we have just discussed. Other formats are not so concise. Rather than print to file, we can simply change the default print format used by rdflib to explore the textual layout of the other serializations. Here is one of the most common classical serializations, RDF/XML which expresses triples in an XML-based schema:

options(rdf_print_format = "rdfxml")
rdf

Just looking at this is probably enough to explain why so many alternative serializations were created. Another popular format, turtle, looks more like nquads:

options(rdf_print_format = "turtle")
rdf

Here, blank nodes are denoted by []. turtle uses indentation to indicate that all three predicates (creator, description, title) are properties of the same subject.

JSON-LD

While formats such as nquads and turtle provide a much cleaner syntax than RDF/XML, they also introduce a custom format rather than building on a familiar standard (like XML) for which users already have a well-developed set of tools and intuition. After more than a decade of such challenges (RDF specification started 1997, including an the HTML-embedded serialization of RDFa in 2004), a more user friendly specification has emerged in the form of JSON-LD (1.0 W3C specification was released in 2014, the 1.1 specification released in February 2018). JSON-LD uses the familiar object notation of JSON, (which is rapidly replacing XML as the ubiquitous data exchange format, and will be more familiar to many readers than the specialized RDF formats or even XML. Here is our rdf data in the JSON-LD serialization:

options(rdf_print_format = "jsonld")
rdf

In this serialization, our subject corresponds to “the thing in the curly braces,” (i.e. the JSON “object”) which is identified by the special @id property (omitting @id corresponds to a blank node). The predicate-object pairs in the triple are then just JSON key-value pairs within the curly braces of the given object. We can make this format look even more natural by stripping out the URLs. While it is possible to use prefixes in place of URLs, it is more natural to pull them out entirely, e.g. by declaring a default vocabulary in the JSON-LD “Context”, like so:

rdf_serialize(rdf, "example.json", "jsonld") %>% 
  jsonld_compact(context = '{"@vocab": "http://purl.org/dc/elements/1.1/"}')

The context of a JSON-LD file can also define datatypes, use multiple namespaces, and permit different names in the JSON keys from that found in the URLs. While a complete introduction to JSON-LD is beyond our scope, this representation essentially provides a way to map intuitive JSON structures into precise RDF triples.

From tables to Graphs

So far we have considered examples where the data could be represented in tabular form. We frequently encounter data that cannot be easily represented in such a format. For instance, consider the JSON data in this example:

ex <- system.file("extdata/person.json", package="rdflib")
cat(readLines(ex), sep = "\n")
#jsonld_compact(ex, "{}")

This JSON object for a Person has another JSON object nested inside (a PostalAddress). Yet if we look at this data as nquads, we see the familiar flat triple structure:

options(rdf_print_format = "nquads")
rdf <- rdf_parse(ex, "jsonld")
rdf

So what has happened? Note that our address has been given the blank node URI _:b0, which serves both as the object in the address line of the Person and as the subject of all the properties belonging to the PostalAddress. In JSON-LD, this structure is referred to as being ‘flattened’:

jsonld_flatten(ex, context = "https://schema.org/")

Note that our JSON-LD structure now starts with an object called @graph. Unlike our opening examples, this data is not tabular in nature, but rather, is formatted as a nested graph. Such nesting is very natural in JSON, where objects can be arranged in a tree-like structure with a single outer-most set of {} indicating a root object. A graph is just a more generic form of a tree structure, where we are agnostic to the root. (We could in fact use the @reverse property on address to create a root PostalAddress that contains the Person). In this way, the notion of data as a graph offers a powerful generalization to the notion of tabular data. The @graph above consists of two separate objects: a PostalAddress (with id of _:b0) and a Person (with an ORCID id). This layout acts much like a foreign key in a relational database, or as a list-column in tidyverse (e.g. see tidyr::nest()). rdflib uses this flattened representation when serializing JSON-LD objects. Note that JSON-LD provides a rich set of utilities to go back and forth between flattened and nested layouts using jsonld_frame. For instance, we can recover the original structure just by specifying a frame that indicates which type we want as the root:

jsonld_flatten(ex) %>%
  jsonld_frame('{"@type": "https://schema.org//Person"}') %>%
  jsonld_compact(context = "https://schema.org/")

(Recall that compacting just replaces URIs and any type declarations with short names given by the context). This is somewhat analogous to join operations in relational data, or nesting and un-nesting functions in tidyr. However, when working with RDF, the beautiful thing is that the differences between these two representations (nested or flattened) are purely aesthetic. Both representations have precisely the same semantic meaning, and are thus precisely the same thing in RDF world. We will never have to orchestrate a join on a foreign key before we can perform desired operations like select and filter on the data. We don’t have to think about how our data is organized, because it is always in the same molten triple format, whatever it is, and however nested it might be.

Just as we saw gather could provide a relatively generic way of transforming a data.frame into RDF triples, JSON-LD defines a relatively simple convention for getting nested data (e.g. lists) into RDF triples. This convention simply treats JSON {} objects as subjects (often assigning blank node ids, as we saw with row ids), and key-value pairs (or in R-speak, list names and values) as predicates and objects, respectively. Any raw JSON file can be treated as JSON-LD, ideally by specifying an appropriate context, which serves to map terms into URIs as we saw with data.frames. JSON-LD is then already a valid RDF format that we can parse with rdflib.

For instance, here is a simple function for coercing list objects into RDF with a specified context:

as_rdf.list <- function(x, context = "https://schema.org/"){
  if(length(x) == 1) x <- x[[1]]
  x[["@context"]] <- context
  json <- jsonlite::toJSON(x, pretty = TRUE, auto_unbox = TRUE, force = TRUE)
  rdflib::rdf_parse(json, "jsonld")
}

Here we set a default context (https://schema.org/), and map a few R terms to corresponding schema terms

context <- list("https://schema.org/", 
                list(schema = "https://schema.org//",
                     given = "givenName",
                     family = "familyName",
                     title = "name",
                     year = "datePublished",
                     note = "softwareVersion",
                     comment = "identifier",
                     role = "https://www.loc.gov/marc/relators/relaterm.html"))

We can now apply our function on arbitrary R list objects, such as the bibentry class object returned by the citation() function:

options(rdf_print_format = "nquads") # go back to the default


R <- citation("rdflib")
rdf <- as_rdf.list(R, context)
rdf  

SPARQL: A Graph Query Language

So far, we have spent a lot of words describing how to transform data into RDF, and not much actually doing anything cool with said data.

Still working on writing this section

#source(system.file("examples/as_rdf.R", package="rdflib"))
source(system.file("examples/tidy_schema.R", package="rdflib"))

## Testing: Digest some data.frames into RDF and extract back
 cars <- mtcars %>% rownames_to_column("Model")
 x1 <- as_rdf(iris, NULL, "iris:")
 x2 <- as_rdf(cars, NULL, "mtcars:")
 rdf <- c(x1,x2)

SPARQL: Getting back to Tidy Tables!

sparql <-
  'SELECT  ?Species ?Sepal_Length ?Sepal_Width ?Petal_Length  ?Petal_Width
WHERE {
 ?s <iris:Species>  ?Species .
 ?s <iris:Sepal.Width>  ?Sepal_Width .
 ?s <iris:Sepal.Length>  ?Sepal_Length . 
 ?s <iris:Petal.Length>  ?Petal_Length .
 ?s <iris:Petal.Width>  ?Petal_Width 
}'

iris2 <- rdf_query(rdf, sparql)

We can automatically create the a SPARQL query that returns “tidy data”. Tidy data has predicates as columns, objects as values, subjects as rows.

sparql <- tidy_schema("Species",  "Sepal.Length", "Sepal.Width", prefix = "iris")

rdf_query(rdf, sparql)

  1. “The semantic web is the future of the internet and always will be.” -Peter Norvig, Director of Research at Google↩︎

  2. Couldn’t we just have used another column? Perhaps, but then it wouldn’t be a triple. More to the point, the datatype modifies object alone, not the predicate or subject.↩︎

  3. Technically I believe it should be a NCName, defined by the regexp [\i-[:]][\c-[:]]*. Essentially, this says it cannot include symbol characters like :, @, $, %, &, /, +, ,, ;, whitespace characters or different parenthesis. Furthermore an NCName cannot begin with a number, dot or minus character although they can appear later in an NCName.↩︎