2 Definition of Datacube

A set of (n + 1)-tuples (d1,d2,,dn,m) with d1 D1,d2 D2,,dn Dn,m M is called an n-dimensional datacube. D1,D2,,Dn are finite sets of members and are the dimensions of the cube. M is also a set of values (normally numerical) called the measurement.

A dimension is a structural attribute of the cube, that is, a list of members all of which are of a similar type in the user’s perception of the data. For example, all months, quarters, years, etc., make up a time dimension; likewise all cities, regions, countries, etc., make up a geography dimension. A dimension acts as an index for identifying values within a multi-dimensional array. If one member of the dimension is selected, then the remaining dimensions in which a range of members (or all members) are selected define a sub-cube. If all but two dimensions have a single member selected, the remaining two dimensions define a spreadsheet (or a "slice" or a "page"). If all dimensions have a single member selected, then a single cell is defined. A cell can also be identified as a single tuple of the datacube. Dimensions offer a very concise, intuitive way of organizing and selecting data for retrieval, exploration and analysis. A member is a discrete name or identifier used to identify a data item’s position and description within a dimension. For example, January, 1989 or 1Qtr93 are typical examples of members of a time dimension. Wholesale, Retail, etc., are typical examples of members of a distribution channel dimension.

Figure 1: A 3-dimensional cube

As an example see the Figure 1. It represents a 3-dimensional cube with three dimensions time, region, and product. The measurement is Sale. The members of time are {1996, 1997, 1998}, the members of region are {CEE,USA}, and the members of product are {P1,P2,P3}. The cell (1997,CEE,P2) contains the value 500. The interpretation is that 500 units of the product P2 has been sold in 1997 in CEE. The cube contains 3 × 3 × 2 = 18 cells.

A member combination is an exact description of a unique cell in the datacube which contains a single value (the measurement). A datacube can also be seen as a multi-dimensional array. A cell – a unique tuple – can be seen as a single data point that occurs at the intersection defined by selecting one member from each dimension in a multi-dimensional array. The maximal number of cells is given by the cardinalities of the dimensions as: |D1|⋅|D2|⋅ ⋅|Dn|. The tuple-set is also called the Cartesian Product of the dimensions. A datacube can be dense or sparse. It is called dense if a relatively high percentage of the possible combinations of its dimension members contain data values, otherwise it is called sparse. It is important to see, that very sparse datacubes are very common in practice.

From a database point of view, an n-dimensional datacube is typically stored as a database table containing n + 1 fields, the first n fields representing the dimensions and the (n + 1)-th field represents the measurement (the data value). The first n fields are typically (but not necessary) foreign keys pointing to a table filled with basis “identifier” lists. It is important to note that a table is a more general concept than a datacube: (1) The same tuple of members in a datacube – defining a cell – can occur only once in a datacube, while it can occur several times in a database table, except when a primary key is defined on the “dimensional” fields. In praxis, however, this is not a limitation, because we mostly analyze data which can be classified according some dimensions and hence have distinct tuples. (2) A database table – besides the "dimensional" fields – can contain several “measurements” fields. If this is the case, then several datacubes with the same dimensions can be built, or a datacube with an additional dimension, which contains a “measurement” field name as members, if the measurements are all of the same data type (numeric, alphanumeric, etc.). If this is not the case the db table can be mapped to a set of datacubes.

We call the list of tuples defining a datacube – normally printed in a vertical way – on a piece of paper, the standard view or the db-view of the datacube. As we shall see, this is nothing else than a particular pivot table view.

In a mathematical notation, a cube can be represented as follows:

mi1,i2,...,in   forall i1 ∈ D1, i2 ∈ D2,...,in ∈ Dn, m ∈ M

i1,i2,,in are called indexes, and the notation is called the indexed notation (see [9]).