The English version of this specification is the only normative version. Non-normative translations may also be available.

There are many situations where it would be useful to be able to publish multi-dimensional data, such as statistics, on the web in such a way that it can be linked to related data sets and concepts. The Data Cube vocabulary provides a means to do this using the W3C RDF (Resource Description Framework) standard. The model underpinning the Data Cube vocabulary is compatible with the cube model that underlies SDMX (Statistical Data and Metadata eXchange), an ISO standard for exchanging and sharing statistical data and metadata among organizations. The Data Cube vocabulary is a core foundation which supports extension vocabularies to enable publication of other aspects of statistical data flows or other multi-dimensional data sets.

This document was published by the Government Linked Data Working Group as a Recommendation. If you wish to make comments regarding this document, please send them to public-gld-comments@w3.org ( subscribe , archives ). All comments are welcome.

This vocabulary was originally developed and published outside of W3C , but has been extended and further developed within the Government Linked Data Working Group.

This document has been reviewed by W3C Members, by software developers, and by other W3C groups and interested parties, and is endorsed by the Director as a W3C Recommendation. It is a stable document and may be used as reference material or cited from another document. W3C's role in making the Recommendation is to draw attention to the specification and to promote its widespread deployment. This enhances the functionality and interoperability of the Web.

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This document describes the Data Cube vocabulary It is aimed at people wishing to publish statistical or other multi-dimension data in RDF. Mechanics of cross-format translation from other formats such as SDMX-ML are not covered here.

A key component of the SDMX standards package are the Content-Oriented Guidelines (COGs), a set of cross-domain concepts, code lists, and categories that support interoperability and comparability between datasets by providing a shared terminology between SDMX implementers [ COG ]. RDF versions of these terms are available separately for use along with the Data Cube vocabulary, see Content oriented guidelines for further details. These external resources do not form a normative part of the Data Cube Vocabulary specification.

There have been a number of important results from this work: two versions of a set of technical specifications - ISO:TS 17369 (SDMX) - and the release of several recommendations for structuring and harmonising cross-domain statistics, the SDMX Content-Oriented Guidelines. All of the products are available at www.sdmx.org . The standards are now being widely adopted around the world for the collection, exchange, processing, and dissemination of aggregate statistics by official statistical organisations. The UN Statistical Commission recommended SDMX as the preferred standard for statistics in 2007.

The Statistical Data and Metadata Exchange (SDMX) Initiative was organised in 2001 by seven international organizations (BIS, ECB, Eurostat, IMF, OECD, World Bank and the UN) to realise greater efficiencies in statistical practice. These organisations all collect significant amounts of data, mostly from the national level, to support policy. They also disseminate data at the supra-national and international levels.

There are a number of benefits to being able to publish multi-dimensional data, such as statistics, using RDF and the linked data approach:

Linked data is an approach to publishing data on the web, enabling datasets to be linked together through references to common concepts. The approach [ LOD ] recommends use of HTTP URIs to name the entities and concepts so that consumers of the data can look-up those URIs to get more information, including links to other related URIs. RDF [ RDF-PRIMER ] provides a standard for the representation of the information that describes those entities and concepts, and is returned by dereferencing the URIs.

The Data Cube vocabulary is focused purely on the publication of multi-dimensional data on the web. We envisage a series of modular vocabularies being developed which extend this core foundation. In particular, we see the need for an SDMX extension vocabulary to support the publication of additional context to statistical data (such as the encompassing Data Flows and associated Provision Agreements). Other extensions are possible to support metadata for surveys (so called "micro-data", as encompassed by DDI ) or publication of statistical reference metadata.

At the heart of a statistical dataset is a set of observed values organized along a group of dimensions, together with associated metadata. The Data Cube vocabulary enables such information to be represented using the W3C RDF (Resource Description Framework) standard and published following the principles of linked data . The vocabulary is based upon the approach used by the SDMX ISO standard for statistical data exchange. This cube model is very general and so the Data Cube vocabulary can be used for other data sets such as survey data, spreadsheets and OLAP data cubes [ OLAP ].

Statistical data is a foundation for policy prediction, planning and adjustments and underpins many of the mash-ups and visualisations we see on the web. There is strong interest in being able to publish statistical data in a web-friendly format to enable it to be linked and combined with related information.

The names of RDF entities -- classes, predicates, individuals -- are URIs. These are usually expressed using a compact notation where the name is written prefix:localname , and where the prefix identifies a namespace URI . The namespace identified by the prefix is prepended to the localname to obtain the full URI.

The key words MUST, MUST NOT, REQUIRED, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this specification are to be interpreted as described in [ RFC2119 ].

As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.

An example of slicing the data would be to define slices in which the time and sex are fixed for each slice. Such slices then show the variation in life expectancy across the different regions, i.e. corresponding to the columns in the above tabular layout.

We can see that there are three dimensions - time period (rolling averages over three year timespans), region and sex. Each observation represents the life expectancy for that population (the measure) and we will need an attribute to define the units (years) of the measured values.

In order to illustrate the use of the data cube vocabulary we will use a small demonstration data set extracted from StatsWales report number 003311 which describes life expectancy broken down by region (unitary authority), age and time. The extract we will use is:

In statistical applications it is common to work with slices in which a single dimension is left unspecified. In particular, to refer to such slices in which the single free dimension is time as Time Series and to refer slices along non-time dimensions as Sections. Within the Data Cube vocabulary we allow arbitrary dimensionality slices and do not give different names to particular types of slice. Such sub-classes of slice could be added in extension vocabularies.

A data publisher may identify slices through the data for various purposes. They can be a useful grouping to which metadata might be attached, for example to note a change in measurement process which affects a particular time or region. Slices also enable the publisher to identify and label particular subsets of the data which should be presented to the user - they can enable the consuming application to more easily construct the appropriate graph or chart for presentation.

It is frequently useful to group subsets of observations within a dataset. In particular to fix all but one (or a small subset) of the dimensions and be able to refer to all observations with those dimension values as a single entity. We call such a selection a slice through the cube. For example, given a data set on regional performance indicators then we might group together all the observations about a given indicator and a given region. Each such group would be a slice representing a time series of observed values.

The attribute components allow us to qualify and interpret the observed value(s). They enable specification of the units of measure, any scaling factors and metadata such as the status of the observation (e.g. estimated, provisional).

The dimension components serve to identify the observations. A set of values for all the dimension components is sufficient to identify a single observation. Examples of dimensions include the time to which the observation applies, or a geographic region which the observation covers.

A statistical data set comprises a collection of observations made at some points across some logical space. The collection can be characterized by a set of dimensions that define what the observation applies to (e.g. time, area, gender) along with metadata describing what has been measured (e.g. economic activity, population), how it was measured and how the observations are expressed (e.g. units, multipliers, status). We can think of the statistical data set as a multi-dimensional space, or hyper-cube, indexed by those dimensions. This space is commonly referred to as a cube for short; though the name shouldn't be taken literally, it is not meant to imply that there are exactly three dimensions (there can be more or fewer) nor that all the dimensions are somehow similar in size.

A DataSet is a collection of statistical data that corresponds to a defined structure. The data in a data set can be roughly described as belonging to one of the following kinds:

6. Creating data structure definitions

A qb:DataStructureDefinition defines the structure of one or more datasets. In particular, it defines the dimensions, attributes and measures used in the dataset along with qualifying information such as ordering of dimensions and whether attributes are required or optional. For well-formed data sets much of this information is implicit within the RDF component properties found on the observations. However, the explicit declaration of the structure has several benefits:

it enables verification that the data set matches the expected structure, in particular helps with detection of incoherent sets obtained by combining differently structured source data;

it allows a consumer to easily determine what dimensions are available for query and their presentational order, which in turn simplifies data consumption, for example for UI construction;

it supports transmission of the structure information in associated SDMX data flows (see below).

It is common, when publishing statistical data, to have a regular series of publications which all follow the same structure. The notion of a Data Structure Definition (DSD) allows us to define that structure once and then reuse it for each publication in the series. Consumers can then be confident that the structure of the data has not changed.

6.1 Dimensions, attributes and measures The Data Cube vocabulary represents the dimensions, attributes and measures as RDF properties. Each is an instance of the abstract qb:ComponentProperty class, which in turn has sub-classes qb:DimensionProperty , qb:AttributeProperty and qb:MeasureProperty . A component property encapsulates several pieces of information: the concept being represented (e.g. time or geographic area),

the nature of the component (dimension, attribute or measure) as represented by the type of the component property,

the type or code list used to represent the value. The same concept can be manifested in different components. For example, the concept of currency may be used as a dimension (in a data set dealing with exchange rates) or as an attribute (when describing the currency in which an observed trade took place). The concept of time is typically used only as a dimension but may be encoded as a data value (e.g. an xsd:dateTime ) or as a symbolic value (e.g. a URI drawn from the reference time URI set developed by data.gov.uk). In statistical agencies it is common to have a standard thesaurus of statistical concepts which underpin the components used in multiple different data sets. To support this reuse of general statistical concepts the data cube vocabulary provides the qb:concept property which links a qb:ComponentProperty to the concept it represents. We use the SKOS vocabulary [ SKOS-PRIMER ] to represent such concepts. This is very natural for those cases where the concepts are already maintained as a controlled term list or thesaurus. When developing a data structure definition for an informal data set there may not be an appropriate concept already. In those cases, if the concept is likely to be reused in other guises it is recommended to publish a skos:Concept along with the specific qb:ComponentProperty . However, if such reuse is not expected then it is not required to do so - the qb:concept link is optional and a simple instance of the appropriate subclass of qb:ComponentProperty is sufficient. The representation of the possible values of the component is described using the rdfs:range property of the component in the usual RDF manner. Thus, for example, values of a time dimension might be represented using literals of type xsd:dateTime or as URIs drawn from a time reference service. In statistical data sets it is common for values to be encoded using some (possibly hierarchical) code list and it can be useful to be able to easily identify the overall code list in some more structured form. To cater for this a component can also be optionally annotated with a qb:codeList to indicate a set of skos:Concept s which may be used as codes. The qb:codeList value may be a skos:ConceptScheme , skos:Collection or qb:HierarchicalCodeList . In such a case the rdfs:range of the component might be left as simply skos:Concept but a useful design pattern is to also define an rdfs:Class whose members are all the skos:Concept s within a particular scheme. In that way the rdfs:range can be made more specific which enables generic RDF tools to perform appropriate range checking. Note that in any SDMX extension vocabulary there would be one further item of information to encode about components - the role that they play within the structure definition. In particular, it is sometimes convenient for consumers to be able to easily identify which is the time dimension, which component is the primary measure and so forth. It turns out that such roles are intrinsic to the concepts and so this information can be encoded by providing subclasses of skos:Concept for each role. The particular choice of roles here is specific to the SDMX standard and so is not included within the core Data Cube vocabulary. Before illustrating the components needed for our running example, there is one more piece of machinery to introduce, a reusable set of concepts and components based on SDMX.

6.2 Content oriented guidelines This section is non-normative. The SDMX standard includes a set of content oriented guidelines (COG) [ COG ] which define a set of common statistical concepts and associated code lists that are intended to be reusable across data sets. A community group has developed RDF encodings of these guidelines. These comprise: Prefix Namespace Description sdmx-concept http://purl.org/linked-data/sdmx/2009/concept# SKOS Concepts for each COG defined concept sdmx-code http://purl.org/linked-data/sdmx/2009/code# SKOS Concepts and ConceptSchemes for each COG defined code list sdmx-dimension http://purl.org/linked-data/sdmx/2009/dimension# component properties corresponding to each COG concept that can be used as a dimension sdmx-attribute http://purl.org/linked-data/sdmx/2009/attribute# component properties corresponding to each COG concept that can be used as an attribute sdmx-measure http://purl.org/linked-data/sdmx/2009/measure# component properties corresponding to each COG concept that can be used as a measure These community resources are provided as a convenience and do not form part of the Data Cube specification. However, they are used by a number of existing Data Cube publications and so we will reference them within our worked examples.

6.3 Example dimensions and measure This section is non-normative. Turning to our example data set then we can see there are three dimensions to represent - time period, region (unitary authority) and sex. There is a single (primary) measure which corresponds to the topic of the data set (life expectancy) and encodes a value in years. Hence, we need the following components. Time. There is a suitable predefined concept in the SMDX-COG for this, REF_PERIOD, so we could reuse the corresponding component property sdmx-dimension:refPeriod . However, to represent the time period itself it would be convenient to use the data.gov.uk reference time service and to declare this within the data structure definition. Example 1 eg:refPeriod a rdf:Property, qb:DimensionProperty; rdfs:label "reference period"@en; rdfs:subPropertyOf sdmx-dimension:refPeriod; rdfs:range interval:Interval; qb:concept sdmx-concept:refPeriod . Region. Again there is a suitable COG concept and associated component that we can use for this, and again we can customize the range of the component. In this case we can use the Ordnance Survey Administrative Geography Ontology [ OS-GEO ]. Example 2 eg:refArea a rdf:Property, qb:DimensionProperty; rdfs:label "reference area"@en; rdfs:subPropertyOf sdmx-dimension:refArea; rdfs:range admingeo:UnitaryAuthority; qb:concept sdmx-concept:refArea . Sex. In this case we can use the corresponding COG component sdmx-dimension:sex directly, since the default code list for it includes the terms we need. Measure. This property will give the value of each observation. We could use the default smdx-measure:obsValue for this (defining the topic being observed using metadata). However, it can aid readability and processing of the RDF data sets to use a specific measure corresponding to the phenomenon being observed. Example 3 eg:lifeExpectancy a rdf:Property, qb:MeasureProperty; rdfs:label "life expectancy"@en; rdfs:subPropertyOf sdmx-measure:obsValue; rdfs:range xsd:decimal . Unit measure attribute. The primary measure on its own is a plain decimal value. To correctly interpret this value we need to define what units it is measured in (years in this case). This is defined using attributes which qualify the interpretation of the observed value. Specifically in this example we can use the predefined sdmx-attribute:unitMeasure which in turn corresponds to the COG concept of UNIT_MEASURE . To express the value of this attribute we would typically use a common thesaurus of units of measure. For the sake of this simple example we will use the DBpedia resource http://dbpedia.org/resource/Year which corresponds to the topic of the Wikipedia page on "Years". This covers the minimal components needed to define the structure of this data set.

6.4 ComponentSpecifications and DataStructureDefinitions To combine the components into a specification for the structure of this dataset we need to declare a qb:DataStructureDefinition resource which in turn will reference a set of qb:ComponentSpecification resources. The qb:DataStructureDefinition will be reusable across other data sets with the same structure. In the simplest case the qb:ComponentSpecification simply references the corresponding qb:ComponentProperty (usually using one of the sub properties qb:dimension , qb:measure or qb:attribute ). However, it is also possible to qualify the component specification in several ways. Attributes may be declared as optional or required. If an attribute is required to be present for every observation then the specification should set qb:componentRequired . In the absence of such a declaration an attribute is assumed to be optional. The qb:componentRequired declaration may only be applied to component specifications of attributes - measures and dimensions are always required.

. In the absence of such a declaration an attribute is assumed to be optional. The declaration may only be applied to component specifications of attributes - measures and dimensions are always required. The components may be ordered by giving an integer value for qb:order . This order carries no semantics but can be useful to aid consuming agents in generating appropriate user interfaces. It can also be useful in the publication chain to enable synthesis of appropriate URIs for observations.

. This order carries no semantics but can be useful to aid consuming agents in generating appropriate user interfaces. It can also be useful in the publication chain to enable synthesis of appropriate URIs for observations. By default the values of all of the components will be attached to each individual observation, this is called the normalized representation. This allows such observations to stand alone, so that a SPARQL query to retrieve the observation can immediately locate the attributes which enable the observation to be interpreted. However, it is also permissible to attach attributes to the overall data set, to an intervening slice or to a specific Measure (in the case of multiple measures). This reduces some of the redundancy in the encoding of the instance data. To declare such an abbreviated structure, the qb:componentAttachment property of the specification should reference the class corresponding to the attachment level (e.g. qb:DataSet for attributes that will be attached to the overall data set). The classes which can be used as such attachment levels are all subclasses of qb:Attachable . In the case of our running example the dimensions can be usefully ordered. There is only one attribute, the unit measure, and this is required. In the interest of illustrating the vocabulary use we will declare that this attribute will be attached at the level of the data set, however normalized representations are in general easier to query and combine. So the structure of our example data set (and other similar datasets) can be declared by: Example 4 eg:dsd-le a qb:DataStructureDefinition; # The dimensions qb:component [ qb:dimension eg:refArea; qb:order 1 ]; qb:component [ qb:dimension eg:refPeriod; qb:order 2 ]; qb:component [ qb:dimension sdmx-dimension:sex; qb:order 3 ]; # The measure(s) qb:component [ qb:measure eg:lifeExpectancy]; # The attributes qb:component [ qb:attribute sdmx-attribute:unitMeasure; qb:componentRequired "true"^^xsd:boolean; qb:componentAttachment qb:DataSet; ] . Note that we have given the data structure definition (DSD) a URI since it will be reused across different datasets with the same structure. Similarly the component properties themselves can be reused across different DSDs. However, the component specifications are only useful within the scope of a particular DSD and so we have chosen to represent them using blank nodes.