Abstract

Good descriptions of data are essential to finding, understanding and ultimately reusing data. However, data is all too often published without adequate descriptions of its provenance.

Here, we describe nanopublications, a community-driven approach to representing structured data along with its provenance into a single publishable and citable entity. A nanopublication minimally consists of an assertion, the provenance of the assertion, and the provenance of the nanopublication. The nanopublication is represented with and may be queried using Semantic Web technologies (RDF, OWL, SPARQL). Nanopublications also feature a mechanism to ensure the integrity of data and its provenance. Access to provenance enables users to assess the trustworthiness of data and provides a mechanism by which authors and institutions may be acknowledged for their contribution to the global knowledge graph. Nanopublications may be used to expose quantitative and qualitative data, as well as hypotheses, claims, and negative results that usually goes unpublished. With nanopublications, it is possible to disseminate individual data as independent publications with or without an accompanying research article.

This document describes the structure of nanopublications and offers guidelines in their composition, implementation and use. It was produced by members of the Concept Web Alliance (CWA), an open collaborative community that is actively addressing the challenges associated with the production, management, interoperability and analysis of unprecedented volumes of data. We collaborate via Github

Primer

Nanopublications combines an assertion, the provenance of the assertion, and the provenance of the nanopublication into a single publishable and citable entity. Both the assertion and provenance are represented as RDF graphs.

The assertion graph of the nanopublication contains an assertion that is comprised of one or more RDF triples (subject-predicate-object tuples). Examples of assertions include:

 :assertion {
    :trastuzumab :is-indicated-for :breast-cancer .
 }
 :assertion {
  :BRCA1-gene :is-involved-in :breast-cancer .
  :BRCA1-gene :encodes :BRCA1-protein .
  :BRCA1-protein :is-expressed-in, :breast .
 }

Paul: Michel suggests adding the following to explain assertion "An assertion is a proposition that is falsifiable, that is to say we can test whether the proposition is true or false." - this is under discussion.

The provenance graph of the nanopublication contains one or more RDF triples that provide information about the assertion. A nanopublication MUST have a provenance graph identifier linked to the assertion graph identifier. Provenance means, ‘how this came to be’, and may include any statement that discusses how the assertion was generated, who generated it, when was it generated, where was the assertion obtained from, and any other similar information. Examples of assertional provenance include:

:provenance {
    :assertion prov:generatedAtTime "2012-02-03T14:38:00Z"^^xsd:dateTime  .
    :assertion prov:wasDerivedFrom :experiment . 
    :assertion prov:wasAttributedTo :experimentScientist .
}

The publicationInfo graph contains one or more RDF triples that offer provenance information regarding the nanopublication itself. In this case, the subject of the triples in the publicationInfo graph MUST be the nanopublication URI and SHOULD contain attribution and timestamp. Examples of the nanopublication provenance include:

:pubInfo {
 :nanopubEx prov:wasAttributedTo :paul .
 :nanopubEx prov:generatedAtTime "2012-10-26T12:45:00Z"^^xsd:dateTime .
}

The nanopublication itself receives its own identifier and refers to its parts.

:nanopubEx np:hasAssertion :assertion .
:nanopubEx np:hasProvenance :provenance .
:nanopubEx np:hasPublicationInfo :pubInfo .

Nanopublication Ontology

The structure of a nanopublication is defined by the following ontology using the Web Ontology Language (OWL). The namespace http://www.nanopub.org/nschema. Our namespace policy is that the current version of the ontology is always at this url. We suggest using TriG syntax for writing nanopublications.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix rdfg: <http://www.w3.org/2004/03/trix/rdfg-1/>.
@prefix owl: <http://www.w3.org/2002/07/owl#>.
@prefix np: <http://www.nanopub.org/nschema#>.
@prefix dc: <http://purl.org/dc/elements/1.1/>.
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.

<http://www.nanopub.org/nschema> a owl:Ontology ;
	dc:created "2012-10-26"^^xsd:date;
	dc:modified "2013-10-23"^^xsd:date;
	dc:description <http://www.nanopub.org/2013/WD-guidelines-20131215/> ;
	owl:priorVersion <http://www.nanopub.org/nschema-1.9> .

## A nanopublication should be associated with at most one assertion,
## provenance of assertion, and provenance of the nanopublication (publicationInfo)
## We use rdfg:Graph to denote a graph as used in the RDF 1.1. Trig syntax. 
## Each sub class of rdfg:Graph should be separate Named Graph

np:Nanopublication rdf:type owl:Class.
np:Assertion rdfs:subClassOf rdfg:Graph.
np:Provenance rdfs:subClassOf rdfg:Graph.
np:PublicationInfo rdfs:subClassOf rdfg:Graph.

np:hasAssertion a owl:FunctionalProperty.
np:hasAssertion rdfs:domain np:Nanopublication.
np:hasAssertion rdfs:range np:Assertion.

np:hasProvenance a owl:FunctionalProperty.
np:hasProvenance rdfs:domain np:Nanopublication.
np:hasProvenance rdfs:range np:Provenance.

np:hasPublicationInfo a owl:FunctionalProperty.
np:hasPublicationInfo rdfs:domain np:Nanopublication.
np:hasPublicationInfo rdfs:range np:PublicationInfo. 

Well-formed Nanopublications

Tobias: These criteria are taken from nanopub-java, and are up for discussion.

The current ontology is loss in its specification to encourage adoption. We thus define an additional set of criteria to further constraint the usage of the ontology. These criteria may be adopted into further versions of the ontology.

A nanopublication MUST comply with all of the following criteria to be considered well-formed:

@prefix : <http://www.example.org/pubs#> .
@prefix np:  <http://www.nanopub.org/nschema#> .
@prefix prov: <http://www.w3.org/ns/prov#> . 
@prefix xsd:  <http://www.w3.org/2001/XMLSchema#> .

:nanopubEx {
     :nanopubEx a np:Nanopublication .
     :nanopubEx np:hasAssertion :assertion .
     :nanopubEx np:hasProvenance :provenance .
     :nanopubEx np:hasPublicationInfo :pubInfo .
}

:assertion {
    :trastuzumab :is-indicated-for :breast-cancer .
    :assertion a np:Assertion .
}

:provenance {
    :assertion prov:generatedAtTime "2012-02-03T14:38:00Z"^^xsd:dateTime  .
    :assertion prov:wasDerivedFrom :experiment . 
    :assertion prov:wasAttributedTo :experimentScientist .
    :provenance a np:Provenance .
}

:pubInfo {
 :nanopubEx prov:wasAttributedTo :paul .
 :nanopubEx prov:generatedAtTime "2012-10-26T12:45:00Z"^^xsd:dateTime .
 :pubInfo a np:PublicationInfo .
}


# The same example above in NQUADS 
<http://www.example.org/pubs#nanopubEx> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.nanopub.org/nschema#Nanopublication> <http://www.example.org/pubs#nanopubEx> .
<http://www.example.org/pubs#nanopubEx> <http://www.nanopub.org/nschema#hasAssertion> <http://www.example.org/pubs#assertion> <http://www.example.org/pubs#nanopubEx> .
<http://www.example.org/pubs#nanopubEx> <http://www.nanopub.org/nschema#hasProvenance> <http://www.example.org/pubs#provenance> <http://www.example.org/pubs#nanopubEx> .
<http://www.example.org/pubs#nanopubEx> <http://www.nanopub.org/nschema#hasPublicationInfo> <http://www.example.org/pubs#pubInfo> <http://www.example.org/pubs#nanopubEx> .
<http://www.example.org/pubs#trastuzumab> <http://www.example.org/pubs#is-indicated-for> <http://www.example.org/pubs#breast-cancer> <http://www.example.org/pubs#assertion> .
<http://www.example.org/pubs#assertion> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.nanopub.org/nschema#Assertion> <http://www.example.org/pubs#assertion> .
<http://www.example.org/pubs#assertion> <http://www.w3.org/ns/prov#generatedAtTime> "2012-02-03T14:38:00Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> <http://www.example.org/pubs#provenance> .
<http://www.example.org/pubs#assertion> <http://www.w3.org/ns/prov#wasDerivedFrom> <http://www.example.org/pubs#experiment> <http://www.example.org/pubs#provenance> .
<http://www.example.org/pubs#assertion> <http://www.w3.org/ns/prov#wasAttributedTo> <http://www.example.org/pubs#experimentScientist> <http://www.example.org/pubs#provenance> .
<http://www.example.org/pubs#provenance> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.nanopub.org/nschema#Provenance> <http://www.example.org/pubs#provenance> .
<http://www.example.org/pubs#nanopubEx> <http://www.w3.org/ns/prov#wasAttributedTo> <http://www.example.org/pubs#paul> <http://www.example.org/pubs#pubInfo> .
<http://www.example.org/pubs#nanopubEx> <http://www.w3.org/ns/prov#generatedAtTime> "2012-10-26T12:45:00Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> <http://www.example.org/pubs#pubInfo> .
<http://www.example.org/pubs#pubInfo> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.nanopub.org/nschema#PublicationInfo> <http://www.example.org/pubs#pubInfo> .

Integrity Key

Paul Integrity keys are being actively defined and will appear in the next specification.

The goal of the integrity key is to establish an identity for a nanopublication that can be used to check if a nanopublication has changed. Thus, preserving the immutability of a nanopublication. Indeed, nanopublications SHOULD be treated as immutable.

Tobias: I think integrity keys should be part of the nanopub URI. In this way, every reference to a nanopub comes with the possibility to check its integrity. If one nanopub cites another, the integrity key of the cited nanopub flows into the integrity key for the citing one. Therefore, the integrity key ensures not just the integrity of the respective nanopub, but also of all cited nanopub (and nanopubs cited by the cited nanopubs, etc.). This is inspired by how the Git versioning system uses hash values to identify commits. I started to implement this as a proof-of-concept. See the link below.

Michel:Agree with Tobias - it's a major advance to include integrity checks.

Nanopublication Compliant RDF for Large Data Sets

To convert large existing datasets into nanopublications, we recommend using the Vocabulary of Interlinked Datasets ( VoID http://www.w3.org/TR/void/ ) to create a nanopublication compliant RDF (ncRDF) description of the data. In this way we make each entry in the dataset (e.g. data row or sample value) referenceable, which in turn makes it possible to specify that a particular assertion was derived from a specific row of the original dataset.

ncRDF is an intermediate step between the dataset as it is, and nanopublications. Rather than creating an extensive domain model, the RDF dataset uses a simple descriptive model. In practice this means that all data items from the original dataset are transformed (no pre-selection) and that all values remain the same (no normalization or rounding). Furthermore, simple ad-hoc ontologies can be used to provide entity and predicate descriptions: full semantic modeling is only required for the nanopublication itself. See figure.

An example of the use ncRDF in exposing a large dataset is given in the FANTOM5 nanopublictation template. Here, each row of the raw dataset is transformed to a ‘CagePeak resource’. Using the void:inDataset predicate, each CagePeak is linked back to the resource for the entire dataset. Subsequent predicates connect the CagePeak to entities that represent columns of the raw dataset.

Data to Nanopublication

Best & Good Practices

PaulFuture versions may not include a best practice guide and will forward to updating pages. We include these here to help users to get started and as a reminder to update these.

Further Information

Please check http://www.nanopub.org for further guides and community information.

References