Release Notes

mm-ADT is an early stage project. This 0.1-alpha version of the specification is being publicly released in the hopes of receiving feedback on the technical aspects of mm-ADT as well as the social aspects as we search for an appropriate venue to continue its development. There are various design hurdles that are difficult to overcome using pure reason alone. While the specification is being reviewed, the team will transition to prototyping an mm-ADT-bc compiler. The specification roadmap provides an overview of the known issues that need to be addressed as the project moves forward.

mm-ADT is currently being funded by DataStax and RReduX.

Introduction

mm adt logo

An abstract data type (ADT) is a mathematical model for data types, where a data type is defined by its behavior (semantics) from the point of view of a user of the data, specifically in terms of possible values, possible operations on data of this type, and the behavior of these operations. This contrasts with data structures, which are concrete representations of data, and are the point of view of an implementer, not a user.

— Wikipedia Article on Abstract Data Type

This specification defines a multi-model abstract data type (mm-ADT) capable of representing the logical structure and idiomatic processes of common database data types (model-ADT) such as relational tables, n-ary property graphs, RDF statements, document trees, wide-column tuples, key/value pairs, and any hybrid or novel ADT to come.

mm-ADT’s data model and operations encapsulate the many structural and procedural patterns common to all databases. Auxiliary structures such as indices, denormalized data locations, sort orders, and schemas are implicitly exposed to the optimizer (at compile-time) and the processor (at runtime) via data access paths that exist outside the model-ADT’s primary, logical structure. By abstracting away the physical encoding and supporting optimizations, query language compilers and processing engines are able to benefit from the custom features of the underlying storage system all the while remaining agnostic to their existence. mm-ADT provides a common integration point for storage systems, processing engines, and query languages to effectively commingle, regardless of their respective model-ADT orientations.

NoSQL Databases

At the turn of the millennium, the database space was primarily populated by SQL-driven relational databases. The millennium’s second decade saw an explosion of open source database technology focused on providing distributed, fault tolerant storage at the expense of simpler query and transactional semantics. "NoSQL" databases abandoned the expressive, general-purpose relational-ADT in favor of new ADTs tailored to satisfy particular application requirements.

  • Graph Database: Relational database rows are coupled at query-time via an explicit JOIN-operation. Graphs are "pre-joined" structures. In the graph-ADT, vertices are coupled via edges and graph-oriented query languages, capitalizing on the a priori existence of these links, provide users a concise traversal and pattern-matching syntax not available in relational-ADT oriented languages such as SQL.

  • Key/Value Store: Relational databases are bound to the relational-ADT and the real-world limitations created by its expressiveness. If an application only requires simple PUT and GET access to a massive amount of data indexed by unique keys, then while a relational query language can support the query requirements, a relational storage system may not support the scaling requirements. Key/value databases are easier to scale because the key/value-ADT, with its logically partitioned data model and basic operations, naturally support the data’s physical distribution across a multi-machine compute cluster.

  • Document Database: The key/value-ADT does not maintain operations for the internal analysis of the value data. If the value’s data can be represented as a tree-like "document" then a document database can be used. The document-ADT extends the key/value-ADT with structured values (i.e. documents), value-aware indices, and respective query operations capable of traversing the internal structure of the values.

  • RDF Store: Long before the NoSQL database movement, the W3C developed a web-oriented data model called the Resource Description Framework (RDF). The RDF-ADT is structured as a single unordered collection of statements (triples or quads). Each statement has a subject, predicate, and object component, where complex statement patterns are queried using a pattern-matching language called SPARQL. Similar to the aforementioned graph query languages, SPARQL does not have a JOIN operator as RDF statements are implicitly linked when any of their statement components are equal.

  • Wide-Column Store: The wide column-ADT is generally understood to be a relational-ADT with a relaxed data model and no JOIN-operation. Rows are organized into tables, however, rows of the same table aren’t required to have the same structure (i.e. tables are not required to be nxm-matrices). In fact, rows within the same table can have different lengths, where in practice, multi-million column rows are used for data denormalization and the creation of endogenous, ADT-encoded indices.

Multi-Model Databases

To date, databases remain primarily classified by their ADT. While database ADTs are strongly influenced by the storage system’s physical structure (both on-disk and in-memory) and the supported operational primitives of the processing engine, it is ultimately the lexicon and prescribed data-perspective of the query language that solidifies the database’s category. At the level of the processing engine and storage system, many superficial qualities distinguish tables/rows, vertices/edges, statements, and documents. In fact, many database processing engines and storage systems have features that are generally useful across ADTs such as processors supporting stream-oriented parallel and distributed computing with an extensible set of functional primitives and storage systems supporting distribution, fault-tolerance, replication, caching, indexing, and transactions.

The multi-model database trend is founded on the realization that with a sufficiently complex database ADT, multiple model-ADTs can be embedded and when doing so, concepts such as a row, a document, and a vertex lose their meaning as rows and documents can reference each other and vertices and edges can be joined, selected, and projected. However, for multi-model database engineers, while it is important to understand the data model and operations of their target-ADTs, it is equally important to understand the optimization techniques and query idioms used in practice by those databases designed specifically for those model-ADTs. Ultimately, a multi-model database will come to support various indexing, caching, and denormalization techniques, different serial and random access threaded and distributed processors, and either multiple ADT-specific query languages or a single query language that can naturally express all their supported model-ADTs’ idioms. At the limit, when storage system representations, processing engine features, and query language syntaxes are sufficiently malleable to effectively and efficiently structure and process any known ADT instance, the database industry may transition out of its variegated "Tower of Babel"-existence into the world of universal, general-purpose computing much like the hardware and programming language industries did so many decades before.

Intended Audience

This specification presents a database virtual machine architecture. An mm-ADT virtual machine is programmed by a query language and evaluated by a processing engine against data persisted in a storage system. However, this specification can also be used for the intermediate case of query language translation.

use cases
Query Language Translation

The first use case translates a query that is oriented for processing an X model-ADT into a query that is oriented for processing a Y model-ADT via a process called embedding. For example, a graph database query can be translated to a relational database query, where the graph structure is realized as a foreign key-based, first normal form relational structure with JOIN operations replacing traversal operations. model-ADT embedding is not always possible at the query translation-level as the Y model-ADT may lack the expressivity required to capture the semantics of the X model-ADT query. For universal multi-model support, an mm-ADT database virtual machine can be implemented.

Database Virtual Machine

The second use case is for database architectures that have loosely coupled language compilers, processing engines, and storage systems. A query is compiled to language-agnostic mm-ADT bytecode. That bytecode represents a computation aimed at the primary, logical structure of the language’s intended model-ADT. If the language’s model-ADT is different from the storage system’s model-ADT, the bytecode undergoes a translation process defined by an embedding. A processor then executes the resultant bytecode using mm-ADT objects which provide bi-directional communication with the storage system. Database architectures leveraging an mm-ADT virtual machine allow any query language to program any processor to compute over any storage system, where the distinction between implementations lies in the time/space trade-offs made in the respective components' internal designs.

Given these two use cases, the ideas and architecture captured herein may prove useful to the following groups.

  • Database providers: The creators of database products develop and maintain a diverse array of technologies from query to storage. mm-ADT fosters a looser coupling between these technologies. The second part of this specification discusses the secondary structures that are typically used to optimize queries in the respective model-ADT community. mm-ADT provides an ADT agnostic means of leveraging these optimizations. As such, mm-ADT compliant database providers can widen the scope of their product’s offering with new model-ADT’s and respective query languages.

  • Open source database technologists: The open source community has an outstanding number of quality projects that are being integrated into database products at an increasing rate. Examples include batch and stream processors, query language frameworks, distributed storage systems, general-purpose indexing engines, and the like. mm-ADT makes it possible to integrate these components in a standardized manner, thus enabling the proliferation of novel database systems and ADTs.

  • Database enthusiasts: Database technology is influenced by numerous core computer science sub-domains such as networking, hardware, data structures, distribution, compilers, etc. mm-ADT provides a single conceptual and computational framework that integrates many of these database considerations and, unlike most database textbooks, this specification is not focused on the relational model and the particulars of its encoding, processing, and optimization techniques.

mm-ADT Structures and Processes

Most database management systems are organized around a single data model that determines how data can be organized, stored, and manipulated. In contrast, a multi-model database is designed to support multiple data models against a single, integrated backend.

— Wikipedia Article on Multi-Model Database

The presented multi-model abstract data type (mm-ADT) facilitates the embedding of common database ADTs (model-ADTs) into an mm-ADT compliant data storage system to be manipulated by an mm-ADT compliant data query language whose submitted queries are ultimately evaluated by an mm-ADT compliant data processing engine.

  • Storage System (storage-ADT): provides persistence, distribution, transaction, and indexing features.

  • Processing Engine (processor-ADT): provides query evaluation, distribution, workload management features.

  • Query Language (language-ADT): provides query syntax features and compilation.

lang process storage

Every database component is designed according to a particular model-ADT. For instance, a storage system may have a logical and physical representation optimized for processing relational data. Similarly, a query language may be designed to describe processes over a graph. A graph-based query language is unable to direct processes over a relational storage system because there exists an impedence mismatch between the two models. The role of mm-ADT is to provide both compile- and run-time mappings that bridge disparate models in order for query languages to work with storage systems regardless of their respective model-ADTs and moreover, with existing data.

model operations

mm-ADT is not only able to losslessly capture a model-ADT’s logical, primary structure, it also captures secondary data access paths necessary for efficient processing. The following diagram details the relationship between the various components of an mm-ADT database architecture. The concepts presented will be discussed at length over the course of this specification.

system architecture

Before proceeding, an example demonstrating the two previously discussed intended mm-ADT use cases is provided. The source code below is written in a textual bytecode language called mm-ADT-bc. This code defines a particular (naïve) embedding of a property graph-ADT into a relational-ADT. The defined types have various instruction re-write rules (-> x => y) that specify how to read/write vertices and edges to and from a relational structure composed of tables and rows. Rewrites also contain instructions for accessing secondary structures such as table indices.

mm-ADT-bc is a low-level bytecode language similar in nature to a machine assembly language. It is the role of high-level query language designers to develop mm-ADT bytecode compilers for their languages. mm-ADT bytecode can be consumed by any mm-ADT compliant processor and storage system.
A property graph-ADT embedded in a relational-ADT
[model,rdb=>mm,                                                                     (1)
  [define,row,[(@string:@bool|@number|@string)*]]
  [define,table,row*]
  [define,db,[(@string:@table)*]]]

[model,pg=>rdb,                                                                     (2)
  [define,vertex[id:@long~,
                 label:@string~,
                 name:@string~,
                 age:@int~,
                 inE:@edge*,
                 outE:@edge*] =>
    (row[id:$0,label:$1,name:$2,age:$3] => [[db][get,V][has,id,eq,$0]])             (3)
      -> [get,outE] => ((row* => [[db][get,E]]) => [has,outV,eq,$_0])               (4)
      -> [get,inE]  => ((row* => [[db][get,E]]) => [has,inV,eq,$_0])
      -> [drop]     => [[db][get,E]                                                 (5)
                            [or,[has,outV,eq,$_0],
                                [has, inV,eq,$_0]]
                            [drop]
                        [db][get,V]
                            [has,id,eq,$_0]
                            [drop]]],
  [define,edge[id:@long~,
               label:@string~,
               weight:@float~,
               outV:@vertex[id:@long~],
               inV:@vertex[id:@long~]] =>                                           (6)
    (row[id:$0,label:$1,weight:$2,outV:$3,inV:$4] => [[db][get,E][has,id,eq,$0]])   (7)
      -> [put,(outV|inV),@vertex] => [error]],
  [define,db,vertex* => [[db][get,V]],
    -> [new] => [[db]                                                               (8)
                 [put,V,table<row[id:@long,
                                  label:@string,
                                  name:@string,
                                  age:@int]*>   => [[db][get,V]]
                   -> [has,id,eq,@long~]        => row[id:$0]?]                     (9)
                 [put,E,table<row[id:@long,
                                  label:@string,
                                  weight:@float,
                                  outV:@long,
                                  inV:@long]*>  => [[db][get,E]]
                   -> [has,id,eq,@long~]        => row[id:$0]?                      (10)
                   -> [has,outV,eq,@long~]      => (row[outV:$0]*
                     -> [has,label,eq,@string~] =>  row[outV:$_0,label:$0]*)        (11)
                   -> [has,inV,eq,@long~]       => (row[inV:$0]*
                     -> [has,label,eq,@string~] =>  row[inV:$_0,label:$0]*)]]]]     (12)
1 A relational-ADT embedded in mm-ADT (the storage/target-ADT).
2 A property graph-ADT embedded in the relational-ADT (the language/source-ADT).
3 A vertex is encoded as a row in a table called V.
4 Vertex edges are accessible via a table called E.
5 When a vertex is deleted, all its edges are deleted from E.
6 An edge’s incident vertices are accessed via a foreign key to the V table.
7 An edge is encoded as a row in a table called E in first normal form.
8 The property graph-ADT db is initialized with two relational-ADT tables.
9 The V table has a unique index on id (specified unique by ?).
10 The E table has a unique index on id (specified unique by ?).
11 The E table has a composite index on outV and label.
12 The E table has a composite index on inV and label.
The property graph model-ADT presented above is degenerate. The embedding is not generalized to support schema-less graphs as relational-ADT tables require a predefined schema. To keep the example simple, only two tables (V and E) were created and thus, all vertices and edges have a fixed property set. A more complete property graph-ADT model is provided in the property graph model-ADT specification.

The Gremlin graph traversal below creates two vertices and connects them with a knows-edge.

Gremlin traversal creating and connecting two vertices by a knows-edge
g.addV('person').property(id,1).
                .property('name','marko')
                .property('age',29)
  .addE('knows').property(id,7)
                .propery('weight',0.5)
                .to(addV('person').property(id,2)
                                  .property('name','vadas')
                                  .property('age',27))

An mm-ADT compiler for Gremlin would generate the following mm-ADT bytecode. Note that this bytecode is oriented towards the primary, logical structure of a property graph.

Gremlin traversal compiled to property graph-ADT bytecode
[db][put,vertex[id:1,label:person,name:marko,age:29,inE:<>,outE:<>]]     (1)
    [put,vertex[id:2,label:person,name:vadas,age:27,inE:<>,outE:<>]]     (2)
    [obj,vertex[id:1]]                                                   (3)
    [get,outE]                                                           (4)
    [put,edge[id:7,label:knows,weight:0.5f,
              outV:vertex[id:1],inV:vertex[id:2]]]                       (5)
1 A vertex is added to the storage system.
2 Another vertex is added to the storage system.
3 A reference to the first vertex is generated.
4 A reference to the first vertex’s outgoing edges is generated.
5 A knows-edge from the first to the second vertex is added to the first vertex’s outgoing edges.

In order to execute the above bytecode over a relational storage system, the previously defined embedding provides a mechanical procedure for translating property graph bytecode to relational bytecode.

Property graph-ADT bytecode translated to relational-ADT bytecode via embedding
[db][get,V]                                                 (1)
    [put,row[id:1,label:person,name:marko,age:29]]          (2)
    [put,row[id:2,label:person,name:vadas,age:27]]          (3)
//  [obj,row[id:1] => [[db][get,V][has,id,eq,1]]]           (4)
[db][get,E]                                                 (5)
    [put,row[id:7,label:knows,weight:0.5f,outV:1,inV:2]]    (6)
1 The V table is referenced.
2 A "vertex"-row is added to the V table.
3 Another "vertex"-row is added to the V table.
4 A reference to the first row is generated, but is superfluous as it is never dereferenced.
5 The E table is referenced.
6 An "edge"-row is added to the E table.

If the relational storage system has an SQL processor, then instead of using an mm-ADT processor, the above bytecode can be translated to the corresponding SQL query below. Note the structural similarities between the relational bytecode and the SQL query.

Relational-ADT bytecode translated to SQL
INSERT INTO V
  (id,label,name,age)
VALUES
  (1,'person','marko',29),
  (2,'person','vadas',27);
INSERT INTO E
  (id,label,weight,outV,inV)
VALUES
  (7,'knows',0.5,1,2);
pg rdb

The Gremlin graph traversal below determines the average age of the people that vertex 1 knows.

Gremlin traversal for determining the average age of the vertices known by vertex 1
g.V(1).out('knows').values('age').mean()

A compiler from Gremlin to mm-ADT would generate the following property graph bytecode.

Gremlin traversal compiled to property graph-ADT bytecode
[db][has,id,eq,1]
    [get,outE]
    [has,label,eq,knows]
    [get,inV]
    [get,age]
    [mean]

The above bytecode is oriented towards a property graph data structure. In order to leverage the graph representation embedded in the relational structure, the previously defined embedding enables the following translation.

Property graph-ADT bytecode translated to relational-ADT bytecode via embedding
[db][get,V]
    [has,id,eq,1]                        (1)
    [at,[[db][get,E]],
        [[has,outV,eq,![get,id]]         (2) (3)
         [has,label,eq,knows]            (4)
         [at,[[db][get,V]],
             [[has,id,eq,![get,inV]]     (5) (6)
              [get,age]                  (7)
              [mean]]]]]
1 This references an index lookup on table V (based on id).
2 This aborts the previous index lookup on table V (id is known).
3 This references an index lookup on table E (based on outV).
4 This references a composite index lookup on table E (based on outV and label).
5 This triggers the previous composite index lookup on table E (inV is unknown).
6 This references an index lookup on table V (based on id).
7 This triggers the previous index lookup on table V (age is unknown).
The [at] instruction implements a nested loop join. For each incoming record, the [at]-stream is scanned and processed by the subsequent [has]-filters (i.e. the join predicates). In the above bytecode example, where indices are defined, serial scans are avoided and an index nested loop join is enacted.

The relational bytecode can be used by a processor to query the storage system or, if the storage system supports SQL, the bytecode can be translated to the SQL query below and submitted as such. Both approaches allow relational database users to query their data with a graph traversal language.

Relational-ADT bytecode translated to SQL
SELECT AVG(age) FROM V
  WHERE id=(SELECT inV FROM E WHERE outV=1 AND label='knows')
pg rdb 2

Structures

A structural type system (or property-based type system) is a major class of type system, in which type compatibility and equivalence are determined by the type’s actual structure or definition, and not by other characteristics such as its name or place of declaration.

— Wikipedia Article on Structural Type System

object type instance Every denotable entity in mm-ADT is an object. Every object is a pattern capable of being used in pattern matching. Objects that match other objects are called types. Objects that match themselves are called instances. Instances lacking their full value, but with enough data to generate themselves (e.g. a primary key) are called references. The meta-structural aspects of types, instances, and references are specified in the table below for an example "person" record.

Table 1. The meta-structural components of an mm-ADT object
kind symbol (?) type value data access path

type

person

type

[name:@string,age:@int]

[[db][has,name,eq,$0]]

instance

person

[name:marko,age:29]

[[db][has,name,eq,marko]]

reference

person

[name:marko,age:@int]

[[db][has,name,eq,marko]]

An object can have a denoting symbol. The symbol must be respected by all processes. For this reason, typically only types have symbols as propagating symbols in a distributed environment is costly. Every object has a type. A type’s type is type. Instances and references have types specified by the type’s denoting symbol, or for anonymous types (discussed later), the type object itself. Instance and reference types can change over the course of a computation. Every object has a value which is its internal structure. Finally, every object has a data access path that specifies the location of the canonical representation of that object. If the object is in local memory and not associated with a storage system, then its data access path is itself.

In mm-ADT, types are best conceptualized as degenerate instances and references are best conceptualized as lazy instances.

Objects are created using the built-in, fundamental mm-ADT data types diagrammed below, where the directed arrows denote a subtype of-relationship. Every built-in type is structureless in that is has no meta-structural value (see previous table). All custom types have structural definitions captured by their value. The built-in type hierarchy is the only explicit type hierarchy in mm-ADT. All custom types exist in an implicit, structural type graph, where instances are realized as different types depending on the model-ADT context.

type system

The table below presents the tokens of the mm-ADT-bc language used to write mm-ADT bytecode.

Table 2. The tokens of the mm-ADT-bc language
token example description

@

@string

An instance of type string (short for typeof(string))

!

![put,age,29]

Insert the field age:29 (execute bytecode at runtime)

!!

!![op,add,1]

Add 1 to the incoming number (execute bytecode at compile time)

x(y)

lt(40)

A number less than 40 (execute predicate operator)

&

@int&gt(32)

An integer greater than 32 (short for and(x,y))

|

'marko'|'vadas'

Either a 'marko' or 'vadas' string (short for or(x,y))

{ }

@float{2}

Two 32-bit floating point numbers (specific quantifier)

{ , }

@string{2,5}

Two to five strings (range quantifier)

{, }

@int{,2}

Up to two integers (an upper bound quantifier)

{ ,}

@int{2,}

Two or more integers (a lower bound quantifier)

*

@string*

Zero or more strings (short for {0,})

?

@int?

Zero or one integers (short for {0,1})

+

@string+

One or more strings (short for {1,})

( )

(@string|@int)*

Zero or more strings or integers (denotes pattern grouping)

[ ]

[@string*]

A list of zero or more strings (denotes list)

[:]

[(@string:@int){2}]

A record of two string/int pairs (denotes record)

< >

<1.23,0.2>

A stream of 1.23 and 0.2 (denotes stream)

~

~x

Store the bound value to the variable x

~

Store the bound value based on relative position

$

$x

The value bound to the variable x

$2

The 3rd value bound by position

$_2

The 3rd value bound by position in the previous pattern

$_

The root object of the pattern (analogous to this)

->

-> [hasKey,@string]

An instruction with a string argument (denotes instruction)

=>

=> 42.0d

A mapping to 42.0 double (denotes a mapping)

//

// a comment

A comment lasting until the end of the line

/**/

/* a big comment */

A multi-line comment block

A Short Introduction to Bytecode

mm-ADT bytecode is discussed at length in the processes section of this specification. For the concepts presented in this section, it is necessary to have at least a rudimentary understanding of the way in which computations are expressed and executed in mm-ADT.

mm-ADT processing is founded on stream computing principles. A stream is an ever expanding and contracting sequence of objects. Streams are processed by functions. A stream function has an incoming stream and an outgoing stream. A stream function maps objects from its input stream to objects in its output stream. If functions are denoted by vertices and streams by edges, then an mm-ADT computation is structured as a directed (semi)-acyclic graph. Such process graphs are defined using mm-ADT bytecode. mm-ADT bytecode has a textual representation called mm-ADT-bc, where the process graph-structure is encoded using S-expressions.

The mm-ADT-bc source below defines a stream with three instructions (functions). The [branch] instruction has two bytecode arguments (nested sub-streams) each containing one instruction. A stream is denoted < > and the state of the stream after each instruction is specified in the associated comment.

process graph
[[start,1,2,3],    // <1,2,3>
 [is,gt,1],        // <2,3>
 [branch,
   [fold],         // <[2,3]>
   [op,add,7]      // <9,10>
]]                 // <[2,3],9,10>

mm-ADT-bc parsers ignore whitespace and commas between instructions. Thus, the above bytecode can also be written in a linear form.

[[start,1,2,3][is,gt,1][branch,[fold],[op,add,7]]]

Primitives

Primitives form the atomic datum of mm-ADT. A primitive object’s type is known by name only. The table below describes each non-operator primitive type. The regular expression of each type defines a finite state machine capable of recognizing a primitive object written in mm-ADT-bc.

type regular expression example description

bool

true|false

true

A true or false boolean

bytes

An 8-bit byte sequence

int

[0-9]+(I|i)?

6i

A 32-bit signed integer

long

[0-9]+(L|l)

42L

A 64-bit signed integer

float

(\.[0-9])?(F|f)?

92.65F

A 32-bit floating point

double

(\.[0-9])?(D|d)

101.5d

A 64-bit floating point

string

'stephen'

A character sequence

In mm-ADT-bc, the types string, int, and float have an optional abbreviated syntax. Every unspaced character sequence not in the compiler’s symbol table is typed as a string. Any character sequence composed of only numeric characters is typed as an int. Finally, any character sequence composed of numeric characters and a single period character is typed as a float.
Operators

Operators denote simple functions associated with primitive object manipulation. The mm-ADT operators are organized into three categories: predicate, arithmetic, and string.

type constants description

pred

typeof

Binary predicate functions from objects to bool

rel

eq neq gt lt gte lte
within between

Relational functions from objects to bool

arith

add sub mult div pow mod

Binary arithmetic functions from numbers to number

str

len split regex

n-ary string functions

Sequences

A sequence is a linear chain of objects. Sequences can be structured or unstructured. There are two built-in structured sequences: lists and records. Structured sequences provide random access to their elements. An unstructured sequence is called a stream. The elements of an unstructured sequence are sequentially accessed via iteration. Like the fundamental primitives, the built-in sequences have nominal types while their custom derivations are structurally typed.

type value description

list

[@object*]

An ordered sequence of objects accessed by position

record

[(@object:@object)*]

An unordered sequence of pairs accessed by key

stream

@object*

An unordered sequence of objects accessed by iteration

Lists

A list is a sequence indexed by an integer (int or long). A list can contain heterogeneously typed objects. An example list is provided below, where 'josh' could have been written as josh given mm-ADT-bc’s abbreviated syntax for strings.

[42,'josh',42,true]

The mm-ADT bytecode instructions for list-manipulation are provided in the table below and followed by example bytecode using each instruction.

instruction domain range description

[get,@integer]

list

object

Retrieve the object at the specified index

[put,@integer,@object]

list

list

Replace the object at the specified index

[fit,@integer,@object]

list

list

Insert the object at the specified index

[fit,@object]

list

list

Append the object to the end of the list

[drop,@integer]

list

list

Remove the object at the specified index

[start,[42,josh,42,true]]   // [42,josh,42,true]
[obj,[![get,0],![get,3]]]   // [42,true] (a new list created)
[put,1,42]                  // [42,42]
[fit,1,josh]                // [42,josh,42]
[fit,1,josh]                // [42,josh,josh,42]
[fit,true]                  // [42,josh,josh,42,true]
[drop,1]                    // [42,josh,42,true]
Initial instructions such as [start] have no incoming stream. When a computation is initiated, initial instructions emit objects to their outgoing stream. All streams prior to an initial instruction are evaluated, but their output streams are ignored. Thus, the only useful prior streams are those that yield side-effects. All side-effect instructions such as [put], [fit], and [drop] take an object from the incoming stream, mutate it, and then place it on the outgoing stream.
Records

A record is an unordered sequence of key/value pairs called fields. The keys of a record are unique. The example record below has two fields. Each field has a string key (name and age) with respective string and int values (marko and 29).

[name:marko,age:29]

The mm-ADT bytecode instructions for record-manipulation are provided in the table below and then followed by an mm-ADT-bc example demonstrating their use.

instruction domain range description

[get,@object]

record

object

Retrieve the value at the specified key

[put,@object,@object]

record

record

Insert the specified key/value (replacement possible)

[drop,@object]

record

record

Remove the field with the specified key

[start,[name:marko,age:29]]           // [name:marko,age:29]
[obj,[name:![get,name],
      age: ![[get,age][op,add,1]]]    // [name:marko,age:30] (a new record created)
[put,age,29]                          // [name:marko,age:29]
[put,wallet,[5,10,10]]                // [name:marko,age:29,wallet:[5,10,10]]
[drop,wallet]                         // [name:marko,age:29]
The ! token states that the associated list should be interpreted as bytecode. If the [obj] instruction, in the example above, did not contain ! tokens, then the output object would have been [name:[get,name],age:[[get,age][op,add,1]]]. Whenever there is ambiguity as to whether the argument should be treated as a list or as bytecode, mm-ADT-bc will assume list. Therefore, if the argument should be treated as bytecode, the list should be prefixed with !.
Streams

A stream is an unbounded, structureless sequence of objects. Streams are fundamental to mm-ADT processing. Bytecode instructions operate on streams of data. Thus, every instruction is ultimately a stream instruction. Examples are presented below.

[start,42]                      // <42>
[op,sub,13]                     // <29>
[obj,[name:marko,age:![_]]]     // <[name:marko,age:29]>
[branch,[get,name],[get,age]]   // <marko,29>
[is,typeof,string]              // <marko>
[op,len]                        // <5>
[is,gt,6]                       // < >

Streams are lazy evaluation objects. They do not contain objects, but provide access to an object sequence. Thus, they act as references as opposed to instances. The significance of this distinction will be made apparent when discussing the role of references in bytecode evaluation (i.e. query processing). A collection of stream behaviors are enumerated below.

  1. < > is used to denote a stream.

    • <1,2,3> is a stream of the integers 1, 2, and 3.

  2. Every object is implicitly in a stream.

    • <stephen> is processed equivalently as stephen.

  3. Streams can not be nested.

    • <43,<21,12>,3> is equivalent to <43,21,12,3>.

  4. Streams are not generally matched by @object.

    • <marko> is matched by @object.

    • <marko,kuppitz,42> is matched by @object{3}.

A stream can have objects added to it and objects removed from it.

instruction domain range description

[put,@object]

stream

stream

Insert the object into the stream reference.

[drop]

object

Remove the instance at its data access path (assuming in stream).

Custom Types

type syntax An mm-ADT type is an object that describes other objects. The diagram on the left details the syntactical components of an mm-ADT-bc type. Every type definition starts with a type value describing the structure of an instance using bound and unbound values. The component to the right of the mapsto token (=>) specifies the data access path used to refer to the canonical representation of an instance of that type (typically in a storage system). Types can have zero or more instructions. This is analogous to object oriented programming, where a class can have defined methods. However, unlike object oriented programming, the instructions are always from the predefined mm-ADT instruction set and constrained to instructions of the built-in super-type. Type instructions specify rewrite rules for how to achieve the matching instruction’s semantics given the respective model-ADT’s embedding. Type instructions are prefixed with a step token (->). If the processor’s current instruction matches a type instruction of the object being processed, the result to the right of the mapsto token (=>) is merged into the instruction’s outgoing stream. A result can be bytecode (a reference to a stream of objects) or a type (in which case, it is the responsibility of the storage system to manifest a matching instance — typically as a reference).

The type value is the only component required when defining a type.

There are two sorts of types: named types and anonymous types.

  • Named type: any type with a corresponding symbol. The symbol serves as a named-constant.

  • Anonymous type: any type that does not have a corresponding symbol.

A named type is created using the [define] instruction. The following named person type has an anonymous int subtype for the age value.

[define,person,[name:@string,age:@int<between(0,150)>]]

The [define] instruction takes a db object as input. The db object is a globally recognized object denoting the mm-ADT machine. In distributed environments, every node contains a db object. When a type is defined, the definition is synchronized across all db objects before the computation proceeds. As such, named type are typically not defined by queries, but when an mm-ADT system is initialized.

Custom Primitives

mm-ADT primitives can be extended by creating subtypes. A probability type is defined below as a subtype of double. Everything that is a probability is also a double, but everything that is a double is not a probability.

[define,probability,double<between(0.0,1.0)>]
Primitive object types and instances are created with stream notation, where no more than one object can exist in < >. While semantically valid, < > is necessary as a delimiter when parsing.

The newly minted probability can be used in other pattern definitions. For example, the type likely is a subtype of probability.

[define,likely,probability<gte(0.85)>]

Common database and programming language types can be defined in mm-ADT. The source code below specifies SQL’s varchar, short integers, and complex numbers with add and mult operators defined. Note that complex is not a primitive type, but a subtype of list.

[define,varchar255,string<![[op,len][is,lte,255]]>]
[define,short,integer<between(-32767,32767)>]
[define,complex,[@double~x1,@double~x2]
  -> [op,add, complex[@double~y1,@double~y2]] => complex[add($x1,$y1),add($x2,$y2)]]
  -> [op,mult,complex[@double~y1,@double~y2]] => complex[sub(mult($x1,$y1),mult($x2,$y2)),
                                                         add(mult($x1,$y2),mult($x2,$y1))]]
Custom Sequences

If a model-ADT has a data type that is "sequence-like," but requires particular constraints on what can be contained therein, a custom sequence type can be defined. For example, tuples can be defined as a subtype of list.

[define,pair,[@object{2}]]
[define,triple,[@object{3}]]
[define,quad,[@object{4}]]
[define,person_tuple,[@string,@int]]

A relational database row is "record-like" in that the keys are column names and the values are entries. However, unlike an mm-ADT record, a relational row can only have string keys (column names) with values that are either strings, numbers, or bools (row entries). A simple relational data model in first normal form is defined below where rows are records with primitive values and tables are streams of rows.

[define,row,[(@string:(@string|@number|@bool))*]]
[define,table,row*]

The above relational types can be subtyped to create domain-specific schemas. For instance, a people-table may contain person-rows. A person is a subtype of row with only two fields: a string name and an int age. Similarly, people is a subtype of table containing only person objects.

[define,person,row[name:@string,age:@int]]
[define,people,table<@person*>]
[start,people<person[name:marko,age:29],
              person[name:vadas,age:27],
              person[name:josh,age:32],
              person[name:peter,age:35]>]   // a people stream containing 4 persons

If a sequence is created with a type prefix, then the instance must always match the type pattern (at creation and any subsequent mutations). However, it is important to note that the type is not stored with the instance. When an instance is created, its type is the specified type. When an existing instance is retrieved, its type is determined by the model embedding. For example, if a data location is accessed as a table, then row objects are streamed. However, if the same data location is accessed as people, then person objects are streamed.

Type and Instance Derivations

The mm-ADT-bc syntax for creating a type is same as the syntax for creating an instance. The distinction between a type and an instance is whether the object’s value contains unbound values. The first source block below demonstrates the mechanical procedure for deriving a type value from a type definition. Note that the lines following each [define] are demonstrating the derivation and, in this context, are not valid mm-ADT-bc expressions.

Type value derivations
// a person is a record with a string name and an int age
[define,person,[name:@string,age:@int]]
  => [name:@string,age:@int]
  => [![is,eq,name]:![is,typeof,string],![is,eq,age]:![is,typeof,int]]

// a worker is a person that also has a string company
[define,worker,person[company:@string]]
  => [name:@string,age:@int]&[company:@string]
  => [name:@string,age:@int,company:@string]

// a child is a person whose age is less than 10
[define,child,person[age:lt(10)]]
  => [name:@string,age:@int]&[age:lt(10)]
  => [name:@string,age:@int&lt(10)]

// a slave is a child whose is also a worker
[define,slave,child&worker]
  => [name:@string,age:@int&lt(10)]&[name:@string,age:@int,company:@string]
  => [name:@string&@string,age:@int&lt(10)&@int,company:@string]
  => [name:@string,age:@int&lt(10),company:@string]

// a lowly is a child or a worker
[define,lowly,child|worker]
 => [name:@string,age:@int&lt(10)]|[name:@string,age:@int,company:@string]
 => [name:@string|@string,age:@int&lt(10)|(age:@int,company:@string)]
 => [name:@string,age:@int&lt(10)|(age:@int,company:@string)]

The source block below demonstrates the creation of instances. Note that in those situations where unbound values can not be bound, a type is created.

Instance value derivations
person[name:marko,age:29]
 => [name:@string,age:@int]&[name:marko,age:29]
 => [name:@string&marko,age:@int&29]
 => [name:marko,age:29]
 => [![is,eq,name]:![is,eq,marko],
     ![is,eq,age]:![is,eq,29]]                                   // instance

person[name:marko]
 => [name:@string,age:@int]&[name:marko]
 => [name:@string&marko,age:@int]
 => [name:marko,age:@int]                                        // type

worker[name:marko,age:29]
 => [name:@string,age:@int,company:@string]&[name:marko,age:29]
 => [name:@string&marko,age:@int&29,company:@string]
 => [name:marko,age:29,company:@string]                          // type

lowly[name:marko,age:29]
 => [name:@string,age:@int&lt(10)|(age:@int,company:@string)]
     &[name:marko,age:29]
 => [name:@string&marko,
      age:@int&lt(10)&29
     |(age:@int&29,company:@string)]
 => [name:@string&marko,
      age:lt(10)&29                    // <- contradiction
     |(age:29,company:@string)]
 => [name:@string&marko,
     (age:29,company:@string)]
 => [name:marko,age:29,company:@string]                          // type

lowly[name:little-timmy,age:4]
 => [name:@string,age:@int&lt(10)|(age:@int,company:@string)]
     &[name:little-timmy,age:4]
 => [name:@string&little-timmy,
      age:@int&lt(10)&4
     |(age:@int&4,company:@string)]
 => [name:@string&little-timmy,
      age:4                             // <- binding branch
     |(age:4,company:@string)]
 => [name:little-timmy,age:4]                                    // instance

References

Data access is an expensive operation and should be avoided if possible. Unlike typical virtual machine environments (e.g. the JVM), where data is accessible via the physical machine’s random access memory, storage systems (the shared memory system of an mm-ADT machine) encode their data throughout the memory hierarchy and across the network (see data access costs). As such, object values should only be accessed when absolutely necessary. Objects that don’t contain all their values are called references. References serve as proxies to their referent objects. At minimum, a reference must contain enough information to manifest the referent’s data using the associated data access path. Accessing the referent object is called dereferencing. An example person type, instance, and reference are presented below.

person[name:@string~,age:@int] => [[db][has,name,eq,$0]]   // a person type
person[name:marko,age:29]                                  // a person instance
person[name:marko]                                         // a person reference
reference types instances
All mm-ADT optimizations are framed as techniques for avoiding dereferencing or for leveraging spatial and temporal reference information to ensure efficient dereferencing.
Dereferencing

A reference must have enough information to manifest its referent. If not, the reference is a type. For example, given the following person definition, where no data access path is defined, the final instruction does not create a reference, but a subtype of person. That is, it matches all persons named marko regardless of their age.

[define,person,[name:@string,age:@int]]    // a person type
[start,person[name:marko,age:29]]          // a person instance    [name:marko,age:29]
[start,person[name:marko]]                 // a sub-type of person [name:marko,age:@int]

If a type has a data access path, then a reference can be created. The reference must be sufficiently populated to ensure that the data access path forms a proper instance (i.e. the path itself does not contain unbound values).

[define,person,[name:@string~,age:@int] => [[db][has,name,eq,$0]]]   // a person type
[start,person[name:marko]]                                           // a person reference
[get,age]                                                            // an age reference

Since age is not specified in the person[name:marko] reference instance, age is accessed via the concatenation of the data access path and [get,age]. The evaluation of the data access path is called dereferencing.

person[name:marko]                                          // a person reference
person[name:marko,age:![[db][has,name,eq,marko][get,age]]]  // implicit representation
The data access path is the canonical means of accessing the referent via the model-ADT’s primary structure. However, in practice, depending on where the referent is stored (memory, disk, network, etc.) and how the respective mm-ADT object is linked to the storage system, the data access path may not be explicitly used.
Optimizations

References have a significant role in mm-ADT optimization and are discussed at length after mm-ADT bytecode is presented. In the meantime, a simple example is provided.

[define,person,[name:@string,age:@int]]
[define,people,person*
                 -> [has,age,@rel~,@int~] => person[age:$0($1)]*]
[put,people => [db]]

In the mm-ADT-bc source above, people is defined as a (sub)type of stream containing zero or more persons. The people type has an age-based [has] instruction pattern. If the processor invokes a matching [has] on a people instance, an anonymous stream object is returned that is guaranteed to contain zero or more persons with the respective age. This instruction definition is requesting that the storage system produce a direct path from stream to sub-stream in order to subvert the materialization of all the person instances in people. This is how database indices are modeled in mm-ADT.

It is a requirement of the storage system to provide a sublinear time access route to those persons of a particular age. If the storage system does not have the requisite secondary structures to support this "jump," then an error should be reported at [define]. In which case, a linear "scan and filter" is enacted by the processor. In other words, the processor performs the computation over the primary structure which entails iterating the entire person stream.

The person[age:$0($1)]* is a dynamically created anonymous type whose data access path is the source reference data access path concatenated with the yielding instruction. Thus, the data access path is typed as [[db][has,age,@rel~,@int~]]. This data access path represents the logical, linear route to the data.

reference intro A reference instruction’s result must be equivalent to the result of applying the corresponding instruction to the original reference’s referent. This requirement is made salient in the diagram on the right, where applying [has,age,lt,30] to the people referent produces the same result as dereferencing the person[age:lt(30)]* reference. The two data access paths are diagrammed below.

reference intro 2
In diagrams, when an instance has a label that is of the form symbol, it denotes that the instance has the type symbol, not that the object is the type symbol.

References are related to other references via type instructions and the union of these relations is called the reference graph. The compile- and run-time analysis of mm-ADT reference graphs is the primary means by which model-ADT optimizations are realized. The various roles references serve in optimization are enumerated below.

  • Constraints: Types match instances. Reference type instructions can generate constrained subtypes which ultimately provide information to the processor about the nature of the referent and the structures supporting its encoding. This is how complex database schemas such as those containing integrity constrains, indices, sort orders, and the like are modeled.

  • Sub-streams: A stream refers to a sequence of objects. A stream that refers to a subset of a stream is a called a sub-stream reference. Database indices are modeled using sub-stream references.

  • Proxies: Storage systems provide a shared memory structure that allows references with data access paths to be detached, serialized, and re-attached as needed. In this way, references serve as serializable proxies to objects in distributed database environments.

  • Co-Locations: A reference’s value or data access path may contain information about the physical location of the referent. If so, references can migrate to their respective referent’s location and subsequently be processed by a local thread of execution. In this way, references provide data local processing via the co-location of structure and process in distributed database environments.

  • Functions: A reference can refer to the result of a computation executed outside the mm-ADT processor. Storage system aggregation operations are modeled using function references.

Processor and Storage Communication

The storage system and processor communicate with one another through mm-ADT objects. This is what it means for these two components to be mm-ADT compliant. When the processor reads from an mm-ADT object, it is reading data stored in the storage system. When the processor writes to an mm-ADT object, it is writing to the storage system’s physical representation of that object.

mm-ADT reference objects serve a more interesting purpose. In a processor, a reference is a unformed instance with potential access to the object value. In the storage system, a reference points to a potential operation such as, for example, a simple value lookup, a full-table scan, or an index query. The act of dereferencing, causes the storage system to enact the respective operation which manifests the mm-ADT object which is now the fully populated referent of the processor’s respective reference.

processor storage communication

At the semantic level, there is no difference between a reference and an instance. Both denote objects with bound values. It is possible (and advisable) that the storage system generate references as deemed appropriate according to its knowledge of its internal data structures. For example, if a stream contains only a few objects, then a scan and filter approach is typically more efficient than an index lookup. As such, at runtime, the storage system should provide a reference to the most appropriate computation.

[define,people,person*
  -> [has,age,@rel~,@int~] => person[age:$0($1)]*
[put,people => [db]]
[has,age,lt,30]               // scan and filter or index lookup?
processor storage communication 2

References do not dictate the means by which their data access paths are resolved. Instead, a processor simply moves from the tail of the data access path (the reference) to the head of the data access path (the referent) in "one step." It is this ambiguity around data access path resolution that allows storage systems to integrate auxiliary, secondary structures while remaining loosely coupled to the processor.

Database Object

db object Every storage system has a corresponding db object which can be accessed at anytime via the [db] instruction. The [db] instruction is an initial instruction in that it does not have an input stream. The db object that it emits to the output stream is the "root" of the storage system’s data structure. The typical mm-ADT type of the db object for common model-ADTs are presented below. While a db object can be a record, list, stream, etc. it is unique in that machine instructions can also be applied to it.

model-ADT db type description

relational

[(@string:@table)*]

Tables as streams of rows indexed by table name

property graph

vertex*

Vertices as a stream

document

[(@string:document*)*]

Collections as streams of documents indexed by collection name

key/value

[@key,@value]*

Key/value tuples as a stream

rdf

statement*

Statements as a stream

wide-column

[(@string:@table)*]

Tables as streams of variable width rows indexed by table name

Processes

Bytecode, also termed portable code or p-code, is a form of instruction set designed for efficient execution by a software interpreter. Unlike human-readable source code, bytecodes are compact numeric codes, constants, and references (normally numeric addresses) that encode the result of compiler parsing and performing semantic analysis of things like type, scope, and nesting depths of program objects.

— Wikipedia Article on Bytecode
query bytecode processor

mm-ADT operations are instructions drawn from the mm-ADT instruction set. Instructions are organized into bytecode. Bytecode represents a step-by-step, procedural process that maps a stream of input objects to a stream of output objects. Bytecode is generated by a query language compiler and evaluated by a processing engine. The processing engine’s operations must be Turing Equivalent to the model-ADT’s instruction set (a subset of the mm-ADT instruction set). In general, any stream-based processing engine supporting map, filter, reduce, and branch operations can execute any mm-ADT bytecode query.

The operations of a model-ADT must be modeled with mm-ADT instructions. The model-ADT instruction set represents the embedded model-ADT operations.

Instructions

The mm-ADT instruction set is Turing Complete and can be used to generate mm-ADT bytecode that captures the semantics of any query written in any computable query language. To simplify query language compiler development, mm-ADT provides many high-level instructions that encapsulate common database query motifs such as pattern-matching, aggregation, selection/projection, and joins.

An mm-ADT instruction is a list containing a string opcode followed by zero or more object arguments.

inst [@string,@object*]

index

type

description

0

string

The instruction opcode

1

object

The instruction’s first argument

2

object

The instruction’s second argument

3

object

The instruction’s third argument

…​

…​

…​

n

object

The instruction’s nth argument

instruction structure Instruction arguments can be instances known at compile-time or references computed at runtime. This distinction is made salient below, where the two instructions are semantically equivalent, thought the first instruction uses a string instance and the second instruction uses an anonymously typed reference whose data access path yields the string 'marko' for every incoming object.

[has,name,eq,marko]
[has,name,eq,string => [start,marko]]

A more practical example is provided below. In the first instruction, the int age is set at compile-time and remains static for the full execution of the query. In the second example, the int is computed for every incoming object. The latter example demonstrates context-dependent arguments as, for every incoming person, the [has] instruction will only allow to pass those who are older than their oldest friend.

[has,age,gt,25]
[has,age,gt,int => [[get,friends][get,age][max]]]

The above reference’s type is an anonymous subtype of int. The data access path states that instances are in correspondence with the maximum ages of friends (a subset of int). If the reference’s type can be determined from the data access path, then the type object is optional. Thus, the previous two references can be written as below, where ! states that the list should be interpreted as runtime bytecode and not a list object.

[has,name,eq,![start,marko]]                    // N/A (initial) -> string
[has,age,gt, ![[get,friends][get,age][max]]]    // person        -> int

Assume the following bytecode. The [start] generates a reference to a person named marko. The [obj] creates a new record with two fields: name and popularity. The ![get,name] reference does not require dereferencing the previously created person reference as the person has a name value. However, the popularity value requires a dereference of the friends stream to ultimately yield the count.

[db]
[define,person,[name:@string~,age:@int,friends:@person*] => [[db][has,name,eq,$0]]]
[start,person[name:marko]]
[obj,[name      :![get,name],
      popularity:![[get,friends][count]]]]
The reference ![[get,friends][count]] appears to have an unbound value — there is no person object for [get,friends] to consume. However, in fact, there is a bound person object. The reference’s type is defined anonymously within the [obj] instruction and thus, the person object is bound, ultimately, to person[name:marko]. If the reference were to be isolated from its context, it would be a type as [get,friends] is not an initial instruction and thus, the data access path would be unresolvable.

Bytecode

bytecode structure An instruction denotes a single, atomic operation. Bytecode represents a series of such operations which, in aggregate, compute the results of a query. Structurally, bytecode is a list of instructions. As previously demonstrated, instructions can contain bytecode arguments. Therefore, bytecode is a nested, acyclic tree data structure. Unnested bytecode is called a query. Nested bytecode is called a subquery.

Bytecode queries are always written according to the data access paths of the models-ADT’s primary, logical structure. If a particular region of the primary structure has a supporting secondary structure (e.g. an index or a denormalization), then secondary data access paths are made available by the storage system via references. By remaining agnostic to optimizing structures, mm-ADT compliant query language compilers can be used with any mm-ADT compliant processor and storage system.

bc [@inst*] <=> [[@string,@object*]*]

index

type

description

0

inst

The first instruction

1

inst

The second instruction

2

inst

The third instruction

3

inst

The forth instruction

…​

…​

…​

n

inst

The (n-1)th instruction

The instruction [has,name,eq,![get,name]] contains a bytecode argument. This argument is actually a single instruction. The mm-ADT-bc compiler will automatically nest single instructions when the context is known to accept bytecode. Thus, ![get,name] is translated to ![[get,name]]. There are three contexts in which bytecode is known to be the accepted argument.

context example description

=>

=> [has,name,eq,39]

Instructions can only map to bytecode or types

@bc

[branch,[get,name],[get,age]]

Bytecode may be explicitly required in a pattern match

!

![get,name]

Bytecode computes object at runtime

!!

!![op,add,1]

Bytecode computes object at compile-time

Bytecode Source

Bytecode source is used in two situations: modeling and querying.

Modeling

Every mm-ADT compliant storage system must publish a model-ADT describing its primary, logical structure. The definition is an embedding into mm-ADT types. A relational-ADT is defined below, where the [model] instruction’s second argument denotes the embedding’s source and target ADTs. That is, rdb=>mm states that a model denoted rdb is being defined in terms of mm (mm-ADT). The subsequent [define] instructions specify rdb types and their representation in mm-ADT.

The storage-ADT as an mm-ADT embedding
// relational-ADT
[model,rdb=>mm,
  [define,row,[@string:@bool|@string|@number|@bytes]],
  [define,table,row*],
  [define,db,[(@string:@table)*]]]
bytecode source

A domain-specific social-ADT is defined below in terms of the previous relational-ADT. The social-ADT provides a proper, monomorphic sub-model of the relational-ADT. Thus, the embedding is a simple constraint. This is how database schemas are modeled.

A schema as a relational-ADT embedding
// social-ADT
[model,social=>rdb,
  [define,person,row[name:@string,age:@int,company:@string]],
  [define,organization,row[name:@string,founded:@int]],
  [define,people,table<@person*>],
  [define,organizations,table<@organization*>]]

The storage system’s model-ADT may not be a proper super-set of the user’s or query language’s model-ADT. In such situations, a more complex embedding needs to be defined. A key/value-ADT is embedded in the storage system’s relational-ADT below.

A key/value-ADT as a relational-ADT embedding
// key/value-ADT
[model,kv=>rdb,                                                              (1)
  [define,key,@number],                                                      (2)
  [define,value,@string],                                                    (3)
  [define,keyvalue,[@key~,@value~]                                           (4)
    => (row[key:$0,value:$1] => [[db][get,kv][has,key,eq,$0]])               (5)
    -> [put,0,@key]          => [error]                                      (6)
    -> [put,1,@value~]       => [put,value,$0]                               (7)
    -> [get,0]               => [get,key]                                    (8)
    -> [get,1]               => [get,value]],                                (9)
  [define,db,keyvalue* => [[db][get,kv]]                                     (10)
    -> [new] => [[db][put,kv,
                  (table<row[key:@key,value:@value]*> => [[db][get,kv]]      (11)
                    -> [has,key,eq,@key~]             => row[key:$0]?)]],    (12)
    -> [dedup,[get,0]]               => []                                   (13)
    -> [has,0,eq,@key~]              => [has,key,eq,$0]                      (14)
    -> [put,keyvalue[@key~,@value~]] => [coalesce,
                                          [[has,key,eq,$0][put,value,$1]],
                                          [put,[key:$0,value:$1]]]]]         (15)
1 A new model called kv is being embedded into the previously defined rdb model.
2 The kv model has a key type that is a number.
3 The kv model has a value type that is a string.
4 The kv model has a keyvalue type that is a subtype of list containing a key and a value.
5 Every keyvalue’s data access path is to a row in a rdb table called kv.
6 A keyvalue key is immutable.
7 Updating the keyvalue value updates the backing row’s value field.
8 Getting the keyvalue key gets the row’s key field’s value.
9 Getting the keyvalue value gets the row’s value field’s value.
10 The kv model’s db is a stream of keyvalue pairs with a data access path to an rdb kv table.
11 When the model is initialized, a new rdb table is created called kv with the respective key/value schema.
12 The rdb kv table has a unique index on key.
13 Removing duplicate keys from the keyvalue stream is a no-op as all keys are unique.
14 Getting a keyvalue by key triggers an index lookup on the kv table.
15 Adding a keyvalue performs an "update or create" on the kv table based on the prior existence of the key.
Querying

model-ADT definitions are static source files used to stage the mm-ADT storage and processing environment and for performing compile-time translations of respective bytecode submissions. The query below is oriented towards the key/value-ADT. However, given the previously defined embedding, it will be execute over the relational storage system.

A key/value-ADT query
[db][put,keyvalue[1,"a text to a friend"]]
    [put,keyvalue[2,"another text"]]
    [put,keyvalue[3,"i will keep sending texts"]]
[db][has,0,eq,1]
    [get,1]
The key/value-ADT query embedded in the relational-ADT
[db][get,kv][put,row[key:1,value:"a text to a friend"]]
            [put,row[key:2,value:"another text"]]
            [put,row[key:3,value:"i will keep sending texts"]]
[db][get,kv][has,key,eq,1]
            [get,value]
Translation of relational-ADT bytecode to SQL
INSERT INTO kv VALUES (1,"a text to a friend"),
                      (2,"another text"),
                      (3,"i will keep sending texts");
SELECT value FROM kv WHERE key=1;

The key/value data can be accessed by another model-ADT embedding as long as the other embedding uses the same storage system structures (i.e. the same tables). For instance, given this section’s example, SQL can also be used to read/write to the kv-table. If the SQL compiler generates mm-ADT bytecode, then it should not use the kv=>rdb embedding as it is already aligned with the storage-ADT.

Stream Ring Theory

mm-ADT processors evaluate bytecode. mm-ADT bytecode forms an algebraic ring. More specifically, bytecode satisfies the axioms and theorems of stream ring theory. This section presents the key findings of stream ring theory as it relates to bytecode optimization and execution. A more formal treatment is provided in the article below and should be referred to when designing mm-ADT compilers, optimizers, and processors.

Rodriguez, M.A., "Stream Ring Theory", S/V Red Herring’s Ship’s Log: Chronicles in the Sea of Cortez, pages 10—​40, Mulegé, Baja California Sur, México, doi:10.5281/zenodo.2565243, February 2019.

A Short Introduction to Ring Theory

A ring (with unity) is an algebraic structure defined as \$<<A,\+,\cdot>>\$, where \$A\$ is a set with two binary operators \$\+\$ and \$\cdot\$ called "addition" and "multiplication", respectively. The substructure \$<<A,\+>>\$ is a commutative group with additive inverses and an additive identity denoted \$0 \in A\$. If \$a,b,c \in A\$, then ring theory’s axioms are specified as

\${: ((a+b)+c = a+(b+c)), (0+a = a+0 = a), (a-a = a + (-a) = 0) :}\$

and

\$a+b = b+a.\$

The substructure \$<<A,\cdot>>\$ is a monoid with a multiplicative identity denoted \$1 \in A\$ and obeying the ring axioms

\$(a \cdot b) \cdot c = a \cdot (b \cdot c)\$

and

\$1 \cdot a = a \cdot 1 = a.\$

In the aggregate ring structure \$<<A,\+,\cdot>>\$, multiplication is both right and left distributive over addition such that

\$(a + b) \cdot c = (a \cdot c) + (b \cdot c)\$

and

\$a \cdot (b + c) = (a \cdot b) + (a \cdot c).\$

The following theorems can be deduced from the ring axioms (see Theorem 1).

\${: (a + b = a + c \implies b = c), (a + b = 0 \implies a = -b and b = -a), (-(a+b) = (-a) + (-b)), (-(-a) = a), (a \cdot 0 = 0 = 0 \cdot a), (a(-b) = -a(b) = -(ab)), ((-a)(-b) = ab) :}\$

Numerous mathematical and computational structures are rings. For example, in mathematics, numbers with addition and multiplication form a ring. In computing, mm-ADT bytecode forms a ring, where \$\+\$ is [branch] and \$\cdot\$ is concatenation. In the same manner in which algebraic expressions denoting numbers can be simplified using ring theory’s axioms and deduced theorems, so can mm-ADT bytecode. However, stream ring theory has a richer structure, with more axioms and subsequent theorems.

A stream ring is the product of two independent rings with unity known as a function ring and a coefficient ring.

The Function Ring

A function ring is a structure \$<<bb F,\+,\cdot>>\$, where \$bb F\$ is a set of stream functions (mm-ADT instructions). If \$a,b,c,d,e,f \in bb F\$, then the expressions \$a+b\$, \$a \cdot b \cdot c\$ and \$a \cdot b \cdot (c + (d \cdot e)) \cdot f\$ can be represented diagrammatically as process graphs, where the vertices are stream functions and the edges are streams.

ring graphs

The above function ring expressions can be denoted in mm-ADT bytecode using S-expressions, where the stream function elements are represented as instruction opcodes. In the three examples below, the only valid mm-ADT instruction is branch.

[branch,[a],[b]]                // a+b
[a][b][c]                       // a*b*c
[a][b][branch,[c],[[d][e]]][f]  // a*b*(c+(d*e))*f
Identities
add mult identities

Every ring with unity has a \$0\$ and \$1\$ element which are the additive (branch) and multiplicative (composition) identities, respectively. The filter instructions [#] and [_] represent \$0\$ and \$1\$. The ring axioms \$a+0 = a\$ and \$a \cdot 1=a\$ can be represented as mm-ADT bytecode instruction rewrites.

[branch,@inst*~a,[#]] => [$a]   // a+0=a
[       @inst*~a,[_]] => [$a]   // a*1=a
Distributivity
distributive

The right and left distributivity of multiplication over addition can be expressed in mm-ADT bytecode.

[branch,[@inst*~a,@inst+~c],
        [@inst*~b,@inst+~c]] => [[branch,[$a],[$b]][$c]]   // (a*c)+(b*c)=(a+b)*c
[branch,[@inst+~a,@inst*~b],
        [@inst+~a,@inst*~c]] => [[$a][branch,[$b],[$c]]]   // (a*b)+(a*c)=a*(b+c)
Binomials and Multinomials
multinomial

A binomial is any structure of the form \$(a+b)^n\$. A multinomial generalizes a binomial. Bi- and multinomials can be expanded in order to transform a multiplication of additions into an addition of multiplications. The multinomial below is presented with its expanded form. Expansions can be solved using the "foil method." From the perspective of streams, the concatenation of two 2-way branches can be expanded into an equivalent single 4-way branch.

\$(a+b)(b+c) = ab + ac + b^2 + bc\$
The Coefficient Ring

The structure \$<<bb C,\+,\cdot>>\$ is called a coefficient ring. The coefficient ring can be any ring with unity. Rings are prevalent and can be found in the integers and floats with numeric addition and multiplication, complex numbers with complex addition and multiplication, matrices with matrix addition and multiplication, etc. The model-ADTs of all common databases make use of an integer-based coefficient rings. However, mm-ADT is poised to open up new model-ADTs. For example, a quantum model-ADT uses complex matrix coefficients (see Section 5.H).

The universal functional commutativity of coefficients states that coefficients can be moved right or left and aggregated with multiplication without altering the semantics of the computation (see Theorem 6 and its respective corollaries).

The Stream Ring

mm-ADT bytecode is a stream ring. A stream ring is the product of the function ring and a chosen coefficient ring and is denoted \$<<bb C bb F, \+, \cdot >>\$, where

\$bb C bb F = bb C \times bb F = \{(c,a): c \in bb C \wedge a \in bb F\}.\$

In every tuple \$(c,a) \in bb C bb F\$, the first element is a coefficient and the second element is a function. The stream additive operator is defined as

\$ca + db = {((c+d)a,"if " a=b),(ca+db,"otherwise."):}\$

The stream multiplicative operator is defined as

\$ca \cdot db = (c \cdot d)(a \cdot b),\$

where the first \$\cdot\$ is the coefficient ring’s multiplicative operator and the second \$\cdot\$ is the function ring’s multiplicative concatenation operator.

Stream Ring Axioms

Stream ring theory defines seven axioms (see Section 3.C).

\${: (x ~ y = <<cx>> = <<cy>>,""), (<<cx,dy>> = <<dy,cx>>,""), (<<cx,dx>> = <<(c+d)x>>, "(bulk axiom)"), (<<cx>>da = <<(c \cdot d)a(x)>>,"(apply axiom)"), (<<cx>>(da + eb) = (<<cx>>da) + (<<cx>>eb), "(split axiom)"), (<<cx>> + <<dy>> = <<cx,dy>>, "(merge axiom)"), (<<0x>> = <<c\emptyset>> = << >>,"") :}\$
Equality

If two objects x and y are equivalent, then they can replace each other in a stream. Equality is defined as having the same behavior across all the model-ADT’s instructions.

Order

Streams are unordered and thus, two streams with the same two objects are equal streams.

Bulking

Two equal objects in the same stream can be combined into a single object, where the coefficients of the individual objects are summed together according to the coefficient ring’s additive operator.

Apply

When an object is applied to an instruction, the resulting object’s coefficient is equal to the object’s coefficient multiplied by the instruction’s coefficient as defined by the coefficient ring’s multiplicative operator.

Split/Merge

When an object encounters a branch in the stream, it is copied to each branch. When two streams are merged, the objects of each stream are put into a single stream.

Zero

If an object has a zero coefficient (as defined by the coefficient ring), it can be removed from the stream. If an object is an empty stream with a coefficient, it can be removed from the stream.

Streams are atemporal. There are no requirements to the order in which functions are applied, streams are merged, and objects are bulked (see Theorem 5). The consequence of this theorem is that process graphs can be evaluated in a lazy manner with depth-first semantics (save space), breadth-first semantics (save time), or any arbitrary hybrid of the two. Different query space/time requirements will demand different execution semantics, where real-time queries touching a small subset of the storage system’s data will typically use depth-first semantics while batch-time queries touching large subsets of the storage system’s data will typically use breadth-first semantics.

The mm-ADT Address Spaces

An mm-ADT database virtual machines operates in two address spaces: the storage system’s and the processing engine’s address spaces.

addresses

Storage System

The storage system provides a shared memory structure that all mm-ADT machine’s can access and mutate according to the storage system’s transaction semantics. Storage system addresses are mapped to mm-ADT data access paths. When an mm-ADT processor dereferences a data access path, it is up to the storage system to create an mm-ADT object that is populated by and linked to data that may be on another physical machine, serialized on disk, or directly accessible in memory.

Processing Engine

The processing engine supports two memory structures: a shared memory structure accessible via [db] and many shared-nothing memory structures called object threads. The db object contains type definitions and other machine metadata. It is primarily a read-only structure as writes require global synchronization. An object thread contains computational metadata such as the current stream object, previous thread states (i.e. path history), a loop stack, etc. Object threads represent isolated threads of execution flowing through the processor’s streams.

Instruction Set

instruction types The mm-ADT instruction set is a set of instruction types. For example, the instruction [get,@object] is a type as it contains unbound values. This instruction does not exist in the relational-ADT. Instead, the subtype [get,@string] is supported by the relational-ADT. When bytecode is generated by a query language compiler, instances of the supported types are created as in [get,name].

The mm-ADT instruction set is a set of instruction types. Instances of the model-ADT’s supported types are used to create bytecode to process the model-ADT’s structure.

There are 10 kinds of instructions.

  • Flatmap Instructions: transforms the input object to zero or more output objects. (one-to-many)

    • Map Instructions: transform the input object to one output object. (one-to-one)

    • Filter Instructions: transforms the input object to no object or the input object. (one-to-one or none)

  • Barrier Instructions: transforms zero or more input objects to zero or more output objects. (many-to-many)

    • Reduce Instructions: transforms zero or more input objects to a single output object. (many-to-one)

    • Initial Instructions: generate objects on the output stream. (none-to-many)

    • Terminal Instructions: consume objects on the input stream. (many-to-none)

  • Side-Effect Instructions: create, update, or delete an object.

    • Machine Instructions: interacts with the db object and other features of the mm-ADT machine.

  • Branch Instructions: clones the input object to zero more forks/splits/branches in the bytecode program structure.

The map, flatmap, and filter instructions form a function ring. The barrier and reduce functions form a near-ring (see Theorems 16-19). Finally, branch instructions are "structural" in that they are used to shape the process graph and, in of themselves, are not stream functions, but instead, they represent the additive stream operator.

Flatmap Instructions

opcode

description

domain

arguments

range

input

instruction

output

flatmap

Put the incoming object and the internal stream and drain the internal stream to the output stream.

@object

@bc

@object+

[marko,29]

[flatmap,[unfold]]

marko 29

unfold

Stream the incoming objects internal objects

@object

@object+

[marko,29]

[unfold]

marko 29

[marko,[29,42]]

[unfold]

marko [29,42]

42

[unfold]

42

[name:marko,age:29]

[unfold]

[name,marko] [age,29]

Map Instructions

The map ring \$<<bb F_m,\+,\cdot>>\$ contains the set of all map instructions. Map instructions transform one input object to one output object. If the map function \$a: X \rightarrow Y\$ is injective, then for every \$x \in X\$, if \$a(x_1) = a(x_2)\$, then \$x_1 = x_2\$. If the map function \$a\$ is surjective, then for all \$x \in X\$ every \$y \in Y\$ is mapped to. Finally, if \$a\$ is both injective and surjective, then \$a\$ is bijective and there exists a multiplicative inverse \$a^(-1) \in bb F_m\$ such that \$a \cdot a^(-1) = a^(-1) \cdot a = 1\$. Thus, the map instruction \$a\$ defines an isomorphism between the instances of the types \$X\$ and \$Y\$ and the following axioms and theorems for multiplicative inverses can be used in compilation.

\${: (a \cdot b = a \cdot c \implies b = c), (b \cdot a = c \cdot a \implies b = c), (a \cdot b = 1 \implies a = b^(-1) \wedge b = a^(-1)), ((a \cdot b)^(-1) = b^(-1) \cdot a^(-1)), ((a^(-1))^(-1) = a) :}\$

opcode

description

domain

arguments

range

input

instruction

output

map

Put the incoming object onto the internal stream and emit the first output to the output stream.

@object

@bc

@object

[name:marko,age:29]

[map,[[unfold][count]]]

2

[name:marko,age:29]

[map,[count]]

1

[3,2,1]

[map,[unfold]]

3

get

Get the object indexed by the specified key. For lists, the keys are integers. For records, the keys are field keys.

@list|@record

@object*

@object*

[name:marko,age:29]

[get,name]

marko

[name:marko,age:29]

[get,name,age]

marko 29

[42,6,92]

[get,0]

42

[creator:[name:marko,age:29]]

[get,creator]

[name:marko,age:29]

loops

The number of times the object thread has repeated the respective loop section

@object

@string?

@int

stephen

[loops]

3

kuppitz

[loops,i]

4

obj

Emit the argument object for each incoming object

@object

@object

@object

stephen

[obj,2]

2

[1,2,3]

[obj,[![get,0],42]]

[1,42]

path

The path object of the object thread. Bytecode is evaluated on the path objects in a round-robin manner.

@object

[@string*]?,[@bc*]?

@path

marko

[path]

[[db][name:marko][marko]]

marko

[path,[b]]

[b:[name:marko]]

marko

[path,[b,c]]

[b:[name:marko],c:marko]

marko

[path,[b,c],[[type]]]

[b:person,c:string]

subrec

Select a subset of the fields in the record and process the values in a round-robin manner

@record

[@object*],[@bc*]?

@record

[a:[1,1],b:2,c:3]

[subrec,[a,b],[[size]]]

[a:2,b:1]

[a:[1,1],b:2,c:3]

[subrec,[a,b],
  [[size],[type]]]

[a:2,b:int]

type

Get the declared type of the object

@object

@type

42

[type]

int

[42,6,92]

[type]

list

wallet[1,5,20,20]

[type]

wallet

[name:marko,age:29]

[type]

record

person

[type]

type

Filter Instructions

The filter ring \$<<bb F_f,\+,\cdot>>\$ contains the set of all filter instructions. A filter instruction is abstractly defined by a predicate \$p : X \rightarrow \{bb(true),bb(false)\}\$, such that if \$a \in bb F_f\$ and \$x\$ is a stream object, then

\$a(x) = { (x,"if " p(x)=bb(true)),(\emptyset,"otherwise.") :}\$

The filter ring is unique in that its multiplicative monoid \$<<bb F_f,\cdot>>\$ is both idempotent and commutative. The idempotent property means that if any filter instructions repeat within a sequence of filters, all but one can be removed without effecting the semantics of the computation. Moreover, the commutative property guarantees that a sequence of filter instructions can be arranged in any order. For instance, if \$a,b \in bb F_f\$, then

\${: (a \cdot a,=, a,"(idempotent)"), (a \cdot b,=,b \cdot a,"(commutative)".) :}\$

The commutative property allows filter instructions to be sorted by some features such as their cost (computational complexity) and/or their selectivity. In theory, an inexpensive, highly selective filter should be the first filter applied in a sequence. A bytecode optimization strategy based on filter ordering makes use of the filter ring’s total preorder (see Section 5.B).

opcode

description

domain

arguments

range

input

instruction

output

filter

Filter the incoming object is the argument has no object

@object

@bc

@object?

[42,6]

[filter,[[unfold][sum][is,eq,48]]]

[42,6]

[42,6]

[filter,[[unfold][sum][is,eq,49]]]

[name:josh]

[filter,[#]]

[name:josh]

[filter,[_]]

[name:josh]

#

Always filter the incoming object

@object

@object

[name:marko,age:29]

[#]

42

[#]

[1,2,marko]

[#]

_

Never filter the incoming object

@object

@object

[name:marko,age:29]

[_]

[name:marko,age:29]

42

[_]

42

[1,2,marko]

[_]

[1,2,marko]

and

Filter the incoming object if any of the arguments have no object

@object

@bc{2,}

@object

42

[and,[is,lt,43],[is,gt,5]]

42

[kuppitz,marko]

[and,[_],[[unfold][count][is,eq,2]]]

[kuppitz,marko]

[kuppitz,marko]

[and,[#],[[unfold][count][is,eq,2]]]

coin

Filter the incoming object based on a biased coin toss

@object

@floating

@object?

2 7 marko

[coin,0.5]

2 marko

2 7 marko

[coin,0.0]

2 7 marko

[coin,1.0]

2 7 marko

has

Filter the incoming container if it its index value doesn’t satisfy the predicate

@list|@record

@object,@pred,@object

(@list|@record)?

[name:marko,age:29]

[has,name,regex,m.*]

[name:marko,age:29]

[name:josh]

[has,age,gt,25]

[1,2,marko]

[has,2,eq,marko]

[1,2,marko]

hasKey

Filter the incoming container if it doesn’t have a key that satisfies the predicate

@list|@record

@object,@pred,@object

(@list|@record)?

[name:marko,age:29]

[hasKey,regex,na.*]

[name:marko,age:29]

[name:josh]

[hasKey,eq,age]

[1,2,marko]

[hasKey,gt,1]

[1,2,marko]

hasValue

Filter the incoming container if it doesn’t have a value that satisfies the predicate

@list|@record

@object,@pred,@object

(@list|@record)?

[name:marko,age:29]

[hasValue,regex,mar.*]

[name:marko,age:29]

[name:josh]

[hasValue,eq,42]

[1,2,marko]

[hasValue,gt,1]

[1,2,marko]

is

Filter the incoming object if it doesn’t satisfy the predicate

@object

@pred,@object

@object?

true

[is,eq,true]

true

[name:marko]

[is,eq,[name:marko]]

[name:marko]

[kuppitz,gremlin]

[is,typeof,float]

[kuppitz,gremlin]

[is,typeof,[object{2}]]

[kuppitz,gremlin]

[name:kuppitz,
 stuff:
  [1,[2,3],no]]

[is,typeof,
 [name:@string,stuff:[@int,[@int{2}],yes|no]]

[name:kuppitz,
 stuff:
  [1,[2,3],no]]

or

Filter the incoming object if all of the arguments have no object

@object

@bc{2,}

@object

42

[or,[is,lt,43],[is,gt,5]]

42

[kuppitz,marko]

[or,[_],[[unfold][count][is,eq,3]]]

[kuppitz,marko]

[kuppitz,marko]

[or,[#],[[unfold][count][is,eq,2]]]

[kuppitz,marko]

Barrier Instructions

opcode

description

domain

arguments

range

input

instruction

output

barrier

Aggregate the entire input stream, process it, and yield an output stream

@object*

@bc?

@object*

42 6 92

[barrier]

42 6 92

[name:marko]
[name:josh]
[name:stephen]

[barrier,[get,name]]

marko josh stephen

dedup

Deduplicate the input stream based on some feature of the objects

@object*

@bc*

@object*

42 6 6 92

[dedup]

42 6 92

[name:marko]
[name:marko]
[name:marko,age:29]
[name:josh,age:29]

[dedup]

[name:marko]
[name:marko,age:29]
[name:josh,age:29]

[name:marko]
[name:marko]
[name:marko,age:29]
[name:josh,age:29]

[dedup,[get,name]]

[name:marko]
[name:josh,age:29]

[42,a,1][42,a,2][6,a,2]

[dedup,[get,0],[get,1]]

[42,a,1][6,a,2]

[42,a,1][42,a,2][6,a,2]

[dedup,[get,1],[get,2]]

[42,a,1][42,a,2]

order

Order the input stream based on some feature of the objects and relational order. Ascending is gt and descending is lt.

@object*

[lt|gt,@bc?]+

@object*

42 6 6 92

[order,[lt]]

92 42 6 6

[name:vadas,age:27]
[name:josh,age:32]
[name:marko,age:29]

[order,[gt,[get,name]]]

[name:vadas,age:27]
[name:marko,age:29]
[name:josh,age:32]

[name:vadas,age:29]
[name:josh,age:32]
[name:marko,age:29]

[order,
 [lt,[get,age]],
 [gt,[get,name]]]

[name:josh,age:32]
[name:marko,age:29]
[name:vadas,age:29]

range

Filter out all objects in the input stream not within the numeric range

@object*

@integer{2}

@object*

42 6 92

[range,0,2]

42 6

42 6 92

[range,2,3]

92

42 6 92

[range,2,100]

92

sample

Filter a specified number of input objects biased by some feature of the objects

@object*

@integer,@floating?

@object*

tail

Filter those objects not at the end of the input stream

@object*

@integer

@object*

Reduce Instructions

The reduce near-ring \$<<bb F_r,\+,\cdot>>\$ contains the set of all reduce instructions. Multiplication is not right distributive over addition. However, multiplication is left distributive over addition. Reduce instructions are temporal in that all previous instructions must have produced all their outgoing objects before a reduce instruction can process its input stream. Reduce instruction introduce synchronization barriers into the stream.

There exists a subset of reduce functions called the monoidic reduce functions. These functions are idempotent and semi-right distributive (see Section 5.D).

opcode

description

domain

arguments

range

input

instruction

output

reduce

Reduce the input stream according to the specified object

@object*

@object

@object

count

Count the number of objects in the input stream

@object*

@long

42 6 92

[count]

3

[42,6,92]

[count]

1

[count]

0

fold

Put the mutating argument object onto the nested bytecode’s input stream for each incoming object

@object*

@object,@bc

@object

1 2 3

[fold,0,[op,add,![_]]]

6

marko 42 kuppitz

[fold,[:],[put,![_],![_]]]

[marko:marko,42:42,kuppitz:kuppitz]

marko 42 kuppitz

[fold,[],[fit,![_]]]

[marko,42,kuppitz]

group

Group the input stream objects by a key and then reduce the values

@object*

@bc{3}

@record

[42,1][42,5][6,2]

[group,
  [get,0],
  [get,1],
  [sum]]

[42:6,6:2]

[42,1][42,5][6,2]

[group,
  [get,0],
  [obj,1],
  [sum]]

[42:2,6:1]

42 42 6

[group,
  [_],
  [obj,1],
  [sum]]

[42:2,6:1]

marko kuppitz josh

[group,
  [obj,name],
  [_],
  [fold,[]]]

[name:[marko,kuppitz,josh]]

max

Reduce the number input stream to the largest number

@number*

@number

42 6 92

[max]

92

0.1 55 109.2

[max]

109.2

mean

Reduce the number input stream to the mean

@number*

@number

42 6 92

[mean]

46

0.1 55 109.2

[mean]

91.1

median

Reduce the number input stream to the median

@number*

@number

42 6 92

[median]

42

0.1 55 109.2

[median]

55

min

Reduce the number input stream to the smallest number

@number*

@number

42 6 92

[min]

6

0.1 55 109.2

[min]

0.1

sum

Reduce the number input stream to the sum

@number*

@number

42 6 92

[min]

140

0.1 55 109.2

[min]

164.3

Initial Instructions

opcode

description

domain

arguments

range

input

instruction

output

db

Emits the current model-ADT’s storage system

@object

[db]

d[persons:people=>[[db][get,persons]]]b

start

Emits the argument objects

@object*

@object*

[start,1,2,3]

1 2 3

[start,[1,2,3]]

[1,2,3]

Terminal Instructions

opcode

description

domain

arguments

range

input

instruction

output

drop

Drop the incoming object at its data access path

@object

[name:marko]

[drop]

Side-Effect Instructions

opcode

description

domain

arguments

range

input

instruction

output

sideeffect

Evaluate the bytecode argument and emit the incoming object

@object

@bc

@object

[marko,stephen]

[sideeffect,[[db][drop]]]

[marko,stephen]

as

Label the current object in object thread’s path history

@object

@string

@object

kuppitz

[as,a]

kuppitz

drop

Remove the object from the structured sequence by its associated key

@list|@record

@object

@list|@record

[marko,42,29]

[drop,1]

[marko,29]

[name:marko,age:29]

[drop,name]

[age:29]

fit

Insert the object into the list at the specified index location

@list

@int?,@object

@list

[marko,29]

[fit,1,42]

[marko,42,29]

[marko,29]

[fit,42]

[marko,29,42]

put

Add or replace the object with the provided key/value

@list|@record

@object{2}

@list|@record

[marko,29]

[put,0,kuppitz]

[kuppitz,29]

[name:marko,age:29]

[put,name,kuppitz]

[name:kuppitz,age:29]

[name:marko]

[put,age,29]

[name:marko,age:29]

Machine Instructions

opcode

description

domain

arguments

range

input

instruction

output

cost

The cost of executing the instruction on the type, instance, or reference object

@object

@inst

@cost

migrate

Migrate the incoming object to the db object at the respective data access path

@object

send,recv

@object?

vertex[id:1]

[migrate,send]

vertex[id:1]

[migrate,recv]

vertex[id:1]

define

Associate an object (typically a type) with a symbol

@db

@string,@object

@db

d[]b

[define,age,int<gt(0)>]

d[]b

deref

Force the dereference of the incoming object

@object

@object

vertex[id:1]

[deref]

vertex[id:1,b:2,c:3]

explain

Get a description of the bytecode compilation

@string

[explain]

[db][get,V]=>[db][get,1]

model

Define an embedding from a source-ADT to a target-ADT

@db

@string,[define,@string,@object]+

@db

d[]b

[model,pg⇒mm,[define,db,vertex*]]

d[]b

ref

Create a reference to the incoming object

@object

@type?

@object

vertex[id:1,b:2,c:3]

[ref]

vertex[id:1]

vertex[id:1,b:2,c:3]

[ref,vertex[id:@int,b:@int]]

vertex[id:1,b:2]

vertex[id:1,b:2,c:3]

[ref,vertex]

vertex[id:1]

Branch Instructions

opcode

description

domain

arguments

range

input

instruction

output

branch

Copy the incoming stream to each branch stream and then merge to a single outgoing stream

@object*

@bc{2,}

@object*

1 2

[branch,[_],[_]]

1 2 1 2

1 2

[branch,[_],[is,gt,1]]

1 2 2

at

Access the object of the first bytecode argument and then pass that object to the second bytecode argument, where all nested bytecode arguments have the incoming object as their source

@object*

@bc{2}

@object*

[1,[2,[3],4],5]

[at,[[get,1][get,1]],
  [fit,3.5]]

[1,[2,[3,3.5],4,5]]

choose

Map the incoming object using the first bytecode and propagate it down the bytecode patterns it matches

@object

@bc,[@bc,@bc]+

@object*

[name:marko,age:29]

[choose,[get,age],
  [[is,lt,30], [obj,young]],
  [[is,gte,30],[obj,old]]]

young

coalesce

Propagate the incoming object down the first bytecode argument that has an outgoing object

@object

@bc{2,}

@object*

ifelse

If the incoming object satisfies the first argument, then pass to the second, else to the third

@object*

@bc{3}

@object*

1 2 3

[ifelse,[is,gt,1],
  [op,sub,1],
  [op,add,1]]

2 1 2

repeat

The r-bytecode is the loop. The e-bytecode is the emit predicate. The b-bytecode is the break predicate.

@object*

[r|e|b,@bc]{1,3}

@object*

1

[repeat,
  [r,[op,add,1]],
  [e,[is,gt,7]],
  [b,[is,eq,10]]]

8 9 10

1

[repeat,
  [r,[op,add,1]],
  [b,[is,eq,10]]]

10

1

[repeat,
  [r,[op,add,1]],
  [b,[[loops][is,eq,9]]]]

10

1

[repeat,
  [e,[_]],
  [r,[op,add,1]],
  [b,[is,eq,3]]]

1,2,3,3

Secondary Structures

Query optimization is a function of many database management systems. The query optimizer attempts to determine the most efficient way to execute a given query by considering the possible query plans. Generally, the query optimizer cannot be accessed directly by users: once queries are submitted to the database server, and parsed by the parser, they are then passed to the query optimizer where optimization occurs. However, some database engines allow guiding the query optimizer with hints.

— Wikipedia Article on Query Optimization

primary secondary A model-ADT’s primary structure is its logical model. Bytecode is always written in terms of the objects that compose the logical model. Thus, from a processor’s perspective, it appears that it is always constrained to the data access paths offered by the linkages between the objects of the primary structure. However, unbeknownst to the processor, certain objects called references behave identically to their referent counterparts in every way save that when certain instructions are applied to them, auxiliary data access paths are exposed that exist outside the primary structure and amongst any number of secondary structures. Secondary structures allow the processor to "teleport" to other regions of the primary structure — exiting and entering as dictated by the references encountered along the way.

Secondary structures optimize common, idiomatic data access patterns. Example secondary structures include: schemas, indices, denormalizations, views, functions, and proxies. This section will present the way in which common secondary structures are incorporated in mm-ADT.

References Revisited

abstract reference

A processor traverses a model-ADT’s object graph by executing instructions. When certain instructions are applied to certain objects, a reference object is yielded. In the example above, when [f] is applied to the atype instance, the processor moves to a btype* stream reference. If the processor were to dereference the btype* reference it would be directed to the head of the btype* stream. However, if the processor does not dereference btype* and if the processor’s next instruction is [g], then a new reference is accessed that, when dereferenced, yields a single ctype instance. Instead of processing the entire btype* stream with [g], the reference graph (the graph of instruction linkages between references) enables the processor to jump to the result of [g] avoiding the manifestation of objects in the btype* stream. All secondary structures are modeled in this fashion.

Primary Structure Paths

Assume a relational-ADT storage system with the following table instance.

CREATE TABLE persons (
  name varchar(255) NOT NULL,
  age  int          NOT NULL)

The SQL CREATE TABLE statement is expressed in mm-ADT-bc as below.

[db]
[define,person,row[name:@string,age:@int]]              // a person value type
[define,people,table<@person*>]]                        // a people value type
[put,persons,people => [[db][get,persons]]]             // an empty people stream

Suppose the persons-table is populated with a significant number of person objects (on the order of thousands).

[db][get,persons]
    [put,person[name:a,age:1]]
    [put,person[name:b,age:2]]
         ...
    [put,person[name:z,age:26]]

Next, suppose the following query is evaluated by a processor. The table below provides the output for each instruction.

[db]               // [persons:people => [[db][get,persons]]]
[get,persons]      // people => [[db][get,persons]]
                   // person[name:a,age:1]  ... person[name:z,age:26] (iteration required)
[has,age,gt,21]    // person[name:v,age:22] ... person[name:z,age:26]
[get,name]         // v w x y z
[dedup]            // v w x y z
[count]            // 5L

The instruction [get,persons] moves the processor to the head of the people stream. When [has,age,gt,21] is evaluated, people is iterated, manifesting the individual person objects of the people stream. These person objects are individually processed by [has,age,gt,21]. Those persons not filtered have their name-value accessed by [get,name]. The string names are then deduplicated via the barrier instruction [dedup]. The outputted unique string names are then reduced to a final long value by [count]. The processor’s path through the data required the manifestation of every person object and the subsequent analysis of a subset of those objects. The processor’s path was completely contained within the primary structure of the relational-ADT.

primary only
Secondary Structure Paths

Secondary structures can be defined and leveraged during compile- and run-time processing. The previous example is extended using a richer table definition.

CREATE TABLE persons (
  name varchar(255) NOT NULL UNIQUE,
  age  int          NOT NULL)

CREATE INDEX persons_age_idx ON persons(age)

The table above can also be created in mm-ADT-bc as shown below. There is no way to explicitly create an index with mm-ADT. Instead, an index is created implicitly as the [has,age,@rel,@int] instruction mapping tells the storage system it would like a direct reference from a people stream to persons of a certain age. The storage system can manifest that indirection using any appropriate secondary structure such as an index. Next, [dedup,[get,name]] maps to a no-op. This instruction mapping tells the storage system that no two person objects have the same name. To guarantee that the people instance’s semantics are upheld, the storage system should apply a uniqueness constraint.

[define,person,row[name:@string,age:@int]]
[define,people,table<@person*>
                 -> [has,age,@rel,@int] => person[age:$0($1)]*
                 -> [dedup,[get,name]]  => []]
[put,persons,people => [[db][get,persons]]]

The output of each instruction from the previous query is provided below.

[db]               // [persons:people => [[db][get,persons]]]
[get,persons]      // people                   => [[db][get,persons]]
                   //   -> [has,age,@rel,@int] => person[age:$0($1)]*
                   //   -> [dedup,[get,name]]  => []
[has,age,gt,21]    // person[age:gt(21)]*      => [[db][get,persons][has,age,gt,21]]
                   //   -> [get,name]          => string*
                   //   -> [dedup,[get,name]]  => []
[get,name]         // string*      => [[db][get,persons][has,age,gt,21][get,name]]
                   //   -> [dedup] => []
                   //   -> [count] => long
[dedup]            // string*      => [[db][get,persons][has,age,gt,21][get,name]]
                   //   -> [dedup] => []
                   //   -> [count] => long
[count]            // long => [[db][get,persons][has,age,gt,21][get,name][count]]

The storage system has access to secondary structures and is able to expose a data access path from the people stream to those persons of a certain age. In other words, the storage system has direct access to "young people" (via the persons_age_idx index) and thus, [has,age,gt,21] does not require a linear scan of the people stream, which would materialize person instances. The [get,name] instruction requires the names of these young people. The "young people" reference provides an instruction match as the index contains name data (presumably). While the [dedup] instruction requires names, the schema states that all names are unique (e.g. via the UNIQUE keyword). Thus, [dedup] is a no-op and can be safely ignored. Finally, given that names are unique, the number of unique names amongst young people is equivalent to the number of young people. The original [has,age,gt,21]-based index query’s count is provided by the storage system via a long reference. In this way, the processor only manifests a single long value: 5L. Unlike the previous example, the processor is able to jump from people to 5L — subverting the primary structure’s data access paths by way of references which expose secondary index and schema structures.

secondary only
The Reference Graph

The set of all model-ADT references and their relationship to one another via type instructions forms a directed, binary, multi-relational reference graph. The diagram below extends the example from the previous section.

reference graph
  • Vertices are references.

    • Vertex labels denote the reference type.

  • Edges are instructions.

    • Edge labels denote the instruction.

    • The tail vertex is the instruction’s source object.

    • The head vertex is the instruction’s target object.

Every reference graph commutes.

Regardless of the path taken, the destination reference will contain the same referent. The commutativity of reference graphs enables bytecode optimizers to rewrite certain bytecode sequences into semantically equivalent (i.e. yielding the same referents), though computationally more efficient representations. For instructions with bytecode arguments, runtime instruction matching provides processor redirections.

The better secondary structures are at keeping the processor in the reference graph avoiding dereferencing and thus, referent object materialization, the more optimal the model-ADT embedding.

Common Database Features

Database systems share many features in common. These features are captured by mm-ADT in a ADT-agnostic manner. The subsections to follow identify a particular database feature and explain how to represent that feature in mm-ADT. All the examples to follow assume a relational-ADT and SQL.

Schemas

A database system is designed to effectively encode and process a particular model-ADT. Example model-ADTs include a relational-ADT, a graph-ADT, a document-ADT, etc. If the database supports user-defined schemas, then users are able to create domain-specific model-ADTs that are ultimately a monomorphic embedding into the database’s model-ADT. This section will go through various CREATE TABLE examples for creating a domain-specific model-ADT over a relational-ADT in SQL. Each example will discuss the corresponding mm-ADT-bc representation.

Record Types

The SQL statement below creates a persons-table where every person row has an optional name and age field. This is represented in mm-ADT-bc using the zero-or-one ? quantifier on each field.

CREATE TABLE persons (
  name varchar(255),
  age  int,
)
[db][put,persons,[name:@string?,age:@int?]* => [[db][get,persons]]]
Primary Keys

A primary key is a collection-specific, uniquely identifying field (or set of fields). This means that if the collection is processed by a [dedup] based on the primary key field, a no-op should occur.

CREATE TABLE persons (
  name varchar(255),
  age  int,
  PRIMARY KEY (name)
)
[db][put,persons,[name:@string,age:@int?]* => [[db][get,persons]]
                   -> [dedup,[get,name]]   => []]
Sort Orders

The rows of a table can be ordered such that upon access, the resultant stream returns its objects in sort order. This is specified in mm-ADT-bc as an [order] rewrite to [] on the particular key/order defined.

CREATE TABLE persons (
  name varchar(255),
  age  int,
  PRIMARY KEY (name,ASC)
)
[db][put,persons,[name:@string,age:@int?]*    => [[db][get,persons]]
                   -> [dedup,[get,name]]      => []
                   -> [order,[gt,[get,name]]] => []]
Null Values

Unless otherwise stated, a row value can be null. If a field should not contain a null value, then NOT NULL can be specified which amounts to a {1} quantifier in mm-ADT-bc.

CREATE TABLE persons (
  name varchar(255),
  age  int NOT NULL,
  PRIMARY KEY (name,ASC),
)
[db][put,persons,[name:@string,age:@int]*     => [[db][get,persons]]
                   -> [dedup,[get,name]]      => []
                   -> [order,[gt,[get,name]]] => []]
Check Constraints

A row value can be constrained to always satisfy a particular predicate. For instance, in the person domain model, ages should never be negative. This is specified in SQL using CHECK and in mm-ADT-bc using an anonymous int type.

CREATE TABLE persons (
  name varchar(255),
  age  int NOT NULL,
  PRIMARY KEY (name,ASC),
  CHECK (age >= 0)
)
[db][put,persons,[name:@string,age:int<gte(0)>]* => [[db][get,persons]]
                   -> [dedup,[get,name]]         => []
                   -> [order,[gt,[get,name]]]    => []]
Foreign Keys

A field value can refer to another row in a table. The value must be uniquely identifying such as the referent table’s primary key. Foreign keys provide a way to link "objects" which is expressed in mm-ADT-bc using a references.

CREATE TABLE persons (
  name    varchar(255),
  age     int NOT NULL,
  manager varchar(255) NOT NULL,
  PRIMARY KEY (name,ASC),
  CHECK (age >= 0)
  FOREIGN KEY (manager) REFERENCES person(name)
)
[db][put,persons,[name:@string,
                  age:int<gte(0)>,
                  manager:([name:@string~]    => [[db][get,persons][has,name,eq,$0]])]*
                                              => [[db][get,persons]]
                   -> [dedup,[get,name]]      => []
                   -> [order,[gt,[get,name]]] => []]

When specified in mm-ADT-bc, only those instruction rewrites that can not be deduced from the type object are denoted. An implementation may add instruction rewrites founded on type information. The following type definition includes all such instructions.

[db][put,persons,[name:@string,
                  age:int<gte(0)>,
                  manager:([name:@string~] => [[db][get,persons][has,name,eq,$0]])]*
                                                        => [[db][get,persons]]
                   -> [dedup,[get,name]]                => []
                   -> [order,[gt,[get,name]]]           => []
                   -> [hasKey,type,string]              => []
                   -> [hasKey,eq,name|age|manager]      => []
                   -> [has,name,type,string]            => []
                   -> [has,age,type,int<gte(0)>]        => []
                   -> [has,manager,type,[name:@string]] => []
                   -> [has,age,lte|eq,int<lt(0)>]       => [#]]
Indices

A database index is a secondary structure that provides direct access to subsets of the primary structure. Typically, a database index is able to transform linear O(n) "scan and filter" set reductions into sublinear O(1) or O(log n) set reductions via a hash-index or B-tree index, respectively. In mm-ADT, index-based data access paths are modeled with references.

An index is denoted by a filter-instruction that returns a reference to a subset of the current reference’s referents.
Single Key Indices

The following SQL statement creates a table with respective index on name. The index is defined as being UNIQUE and therefore, no two rows in the table can share the same name value.

CREATE TABLE persons (
  name varchar(255),
  age int
)

CREATE UNIQUE INDEX persons_name_idx ON persons (name)

The above SQL construction is expressed in mm-ADT-bc as follows.

[db]
[define,person,[name:@string, age:@int]]
[define,people,person*
                 -> [has,name,eq,@string~] => person[name:$0]?]
[put,persons,people => [[db][get,persons]]]

There is a data access path that subverts the need to dereference people. Furthermore, because all person names are unique, the resultant @person[name:$0]? stream is guaranteed to have 0 or 1 persons. This is expressed with the type’s ?-quantifier. It is through such mechanisms that a storage system exposes its indices. In particular, this is how persons_name_idx is exposed.

If the storage system uses a relational-ADT, then the following SQL query leverages the same secondary structure as a person[name:$0]? reference.

SELECT age FROM persons WHERE name='marko'
single key index

The SQL query is represented in mm-ADT-bc below.

[db][get,persons]
    [has,name,eq,marko]
    [get,age]
An index query is a subtype reference of a reference. An index lookup is the process of dereferencing an index query. The query result is the dereferenced instance.
Multi-Key Indices

A multi-key index (composite index) is modeled with multiple [has] instructions chained together in a reference path.

CREATE TABLE persons (
  name varchar(255),
  age int
)

CREATE INDEX persons_name_age_idx ON persons (name, age)

The 2-key multi-index is defined in mm-ADT-bc below.

[db]
[define,person,[name:@string, age:@int]]
[define,people,person*
                 -> [has,name,eq,@string~]   => (person[name:$0]*
                   -> [has,age,@pred~,@int~] => person[name:$_0,age:$0($1)]*)
                 -> [has,age,@pred~,@int~]   => (person[age:$0($1)]*
                   -> [has,name,eq,@string~] => person[name:$0,age:$_0($_1)]*)]
[put,persons,people => [[db][get,persons]]]
multi key index

If the processor’s current instruction is [has,name,eq,marko], then a new person[name:marko]* reference is created containing a single [has] instruction on age. However, if the current instruction is [has,age,gt,25], then a person[age:gt(25)]* is created with a single [has] instruction on name. The two reference paths ultimately converge on a person[name:marko,age:gt(25)]* reference

The SQL query below uses the multi-key persons_name_age_idx index.

SELECT age FROM persons WHERE name='marko' AND age > 25

The SQL query has the following mm-ADT-bc representation.

[db][get,persons]
    [has,name,eq,marko]
    [has,age,gt,25]
    [get,age]
Denormalizations

Best practices recommend having one and only one physical representation of a logical object. Data structures adhering to this recommendation are considered normalized. However, there are real-world situations that are best solved with denormalized data. Denormalization creates redundant copies of objects in order to co-locate objects with repeatedly co-accessed data. This tends to increase read speeds but decrease write speeds. Denormalization requires an infrastructure to propagate any object updates to all its redundant copies. If the storage system does not provide support for denormalization, then the query language and processing engine must perform the required read/write operations to take advantage of duplicated objects and ensure the integrity of each object clone.

Denormalized Reads

denorm 1 If the storage system does not provide denormalization support, then queries directed at the normalized representation of the primary structure must be rewritten to make use of the denormalized form. However, in mm-ADT, instruction rewrites allow a data access path for normalized data to be rewritten as a data access path to denormalized data. In this way, a secondary structure is encoded in the logical, primary structure and mm-ADT queries can continue to respect the semantics of the normalized encoding while benefiting from the performance benefits of the denormalized form. As an example, suppose there exists two relational tables: persons and projects.

[db]
[define,project,[title:@string,lang:@string]]
[define,person,[name:@string,project:@string]]
[put,persons,person*   => [[db][get,persons]]]
[put,projects,project* => [[db][get,projects]]]
[db][get,persons]
    [put,person[name:marko,project:lop]]
    [put,person[name:vadas,project:lop]]
    [put,person[name:josh,project:ripple]]
    [put,person[name:peter,project:lop]]
[db][get,projects]
    [put,project[title:lop,lang:java]]
    [put,project[title:ripple,lang:java]]

The SQL query below joins the two tables according to the WHERE predicate and then SELECTs the lang value of each person’s project.

SELECT lang FROM persons INNER JOIN projects ON project=title

The mm-ADT bytecode below produces the same result. The following diagram shows the transformation of the data structures with each applied instruction.

[db][get,persons]
    [at,[[db][get,projects]],
         [has,title,eq,![get,project]]]
    [get,lang]
denorm 2

If the lang of the person’s project is always accessed when the people-table is accessed, then future queries will perform better if the projects-table’s lang value is copied into the people-table.

[db][define,person[name:@string,project:@string,(lang:@string)?]]
    [get,projects]
    [at,[[db][get,persons]],
         [has,project,eq,![get,title]]
         [put,lang,![get,lang]]]

In order to leverage the denormalization optimization, the previous SQL query would need to be rewritten to access the co-located lang in the persons-table. Now only a single table access is required and no INNER JOIN is necessary to yield the same result that was previously derived from the normalized encoding.

SELECT lang FROM persons

mm-ADT bytecode is written in terms of the mm-ADT’s logical model. Thus, the denormalization optimization must be able to be leveraged without changing the bytecode. This is accomplished by adding two instruction patterns to the people-table reference.

[db][put,persons,![[get,persons]
                   [ref,person* => [[db][get,persons]]
                     -> [at,[[db][get,projects]],
                            [has,title,eq,![get,project]]] => []]]]
denorm 3

When an [at] instruction combining the people-table and projects-table is evaluated, the above instruction pattern is matched. The result is a no-op. Thus, the subsequent [get,lang] is applied to the persons-table, not a recomputed [at] of the persons- and projects-tables. The instruction pattern directs the processor to the memoized (materialized) representation of the [at]. Given that [at] is a no-op, it can be removed from the bytecode.

[db][get,persons]
    //[at,[[db][get,projects]],
    //     [has,title,eq,![get,project]]]
    [get,lang]

Without [at], the bytecode is represented as below.

[db][get,persons]
    [get,lang]

This representation is semantically equivalent to the denormalized SQL query.

SELECT lang FROM persons
Stream-Based Joins

The [at] instruction is a branch instruction. It consumes two input streams and can reason on them in parallel via an internal nested stream. The [at] instruction is different from all other mm-ADT instructions in that the internal stream’s incoming object is not the same object as the incoming object to bytecode arguments.

[db]
[get,persons]                       (1)
[at,[[db][get,projects]],           (2)
    [has,title,eq,![get,project]]]  (3) (4)
1 A stream of person objects.
2 A stream of project objects.
3 The internal nested stream’s input are projects.
4 The internal nested stream’s arguments' are persons.

The [has] instruction can be seen as a join predicate. It determines if the incoming project’s title is equal to the incoming person’s project. Any number of filter instructions can be composed to yield arbitrarily complex predicates. If a structural join is desired, then, once the two objects have survived the gauntlet of predicates, they can be merged into a single record via [into].

[db]
[get,persons]                      // [name:marko,project:lop]
[at,[[db][get,projects]],          // [title:lop,lang:java]
    [[has,title,eq,![get,project]] // [title:lop,lang:java][name:marko,project:lop]        (1)
     [has,lang,neq,![get,name]]    // [title:lop,lang:java][name:marko,project:lop]        (2)
     [into,a,b]]]                  // [a:[title:lop,lang:java],b:[name:marko,project:lop]] (3)
1 A join-predicate ensuring the person’s project is equal to the project’s title.
2 A nonsense join-predicate demonstrating more filtering.
3 The two parallel objects are put into a record with an a and b field for the project and person, respectively.

The [at] instruction implements a nested for-loop. Each incoming person is tested against every project. This sort of join operation is called a nested loop join. However, because the [has] instructions are being applied to the projects stream reference and because indices are defined (see section specifics), then a full iterative scan of the projects is not executed for each incoming person. The projects stream is reduced to those projects with the exact title of the person’s project. This join operation is called an index nested loop join.

If the two streams were defined with a sort order on the respective fields being analyzed by the join-predicates (via a no-op on an [order] instruction), then when the two streams are certain not to meet the predicate conditions, the internal stream can be halted. A join operation of this nature is called a sort-merge join.

The decision as to which join algorithm to use is not made explicit in mm-ADT bytecode. Instead, reference objects play the role of determining, which paths the processor should take through the primary and secondary structures of the storage system.

Denormalized Writes

When an object is replicated in different areas of the storage system, a mutation to one of those copies requires the same mutation to all the other copies. Object clone synchronization is the reason denormalization is expensive on writes. Synchronization is implemented in mm-ADT using instruction rewrites that "broadcast" the respective update to all copies of the object. If the project lang value is considered the canonical representation, then whenever the lang value is changed on a project, all respective person lang values need to be updated as well. A [branch] can be used to traverse the data access paths of all redundant objects. This technique is analogous to an on-update trigger.

denorm 4
[db]
[define,project[title:@string~]
  -> [put,lang,@string~] => [[branch,[_],[[db][get,persons][has,project,eq,$_0]]]
                             [put,lang,$0]]]
[define,person
  -> [put,lang,@string]  => [error]  // only the canonical source can be updated
  -> [drop,lang]         => [error]]

[db][get,projects]
    [has,title,eq,ripple]
    [put,lang,rdf]
Aggregations

Common database barrier functions such as counting, summing, averaging, grouping, ordering, etc. are more often than not, best accomplished by the storage system (especially when the storage system is not under significant load from concurrent queries). The storage system can offer computational services for operations it can execute more efficiently than the processor via references. A collection of typical aggregations are provided below in mm-ADT bytecode and SQL. The following diagram and mm-ADT-bc source specify a reference graph for offloading barrier operations to the storage system.

mm-ADT bytecode SQL query

[db][get,persons][count]

SELECT COUNT(*) FROM persons

[db][get,persons][get,age][sum]

SELECT SUM(age) FROM persons

[db][get,persons][group,[get,age],[get,name],[fold]]

SELECT * FROM persons GROUP BY age, name

[db][get,persons][order,[lt,[get,age]]]

SELECT * FROM persons ORDER BY age DESC

[db][get,persons][get,age][dedup]

SELECT DISTINCT age FROM persons

functions
[db]
[define,people,person*
  -> [count]                          => long
  -> [get,age]                        => (int*
                                           -> [sum]   => int
                                           -> [dedup] => int*)
  -> [group,[[get,age]],[[get,name]]] => [(@int:[@string*])*]
  -> [order,[lt|gt,[get,age]]]        => @person*]
[put,persons,people => [[db][get,persons]]]

The above people definition does not have to be defined by the user. It is possible for the storage system to provide an anonymous type (at runtime) with various exposed data access paths not specified in the model-ADT embedding. Compile-time bytecode rewriting and runtime processor redirections leverage the same reference graph structure and the evolution of this structure does not have to be a priori fully defined. In this way, the storage system can play an active role in runtime query optimization.

Proxies

A reference object is a proxy to a referent instance. A reference can be manipulated as if it were the referent — providing data when read from and storing data when written to. What makes a reference different from its referent is that it can contain only that data necessary for it to access its referent via the type’s data access path. For large objects such as documents in a document database, references can be used to reduce object creation and limit storage/processor communication.

mm-ADT references are useful as proxies in the following scenarios.

  1. Remote Proxy: The processor is in a different address space (perhaps on a different physical machine) than the storage system node containing the referent instance. Network communication can degrades query performance.

  2. Lazy Proxy: The bytecode will not access all the data in the referent and thus, it is not necessary to load the entire object into memory.

Assume a document-ADT storing JSON documents with instances similar to the document below and matched by the following mm-ADT person definition.

{ "name" : "marko",
  "age"  : 29,
  "alive": true,
  "address": {
    "street" : "el camino",
    "city"   : "santa fe",
    "state"  : "nm"},
  "phones": {
    "home"   : "212 555-1234",
    "office" : "646 555-4567",
    "mobile" : "123 456-7890"},
  "spouse": null }
[db]
[define,person,[name:@string,
                age:@int,
                alive:@bool,
                address:[street:@string,
                         city:@string,
                         state:@string],
                phones:[(@string:@string)*],
                spouse:@person?]
  -> [has,name,eq,@string~] => person[name:$0]?]
[put,persons,person* => [[db][get,persons]]]

The person definition above states that the storage system provides direct access to person documents when search by name. The following bytecode retrieves a single person document and then walks the document tree to ultimately yield the expected result of "212 555-1234". While a storage system index is being used, object materialization and network communication can still be further reduced. The unfortunate aspect of the query is that the entire phones record is fetched when the person is dereferenced.

[db][get,persons]        // people reference
    [has,name,eq,marko]  // person reference
    [get,phones]         // phones instance
    [get,home]           // string instance

The mm-ADT [ref] machine instruction generates a new reference from another reference or instance. The provided anonymous super-type (less constrained) is used to populate the new reference with only that data that will be needed for the remainder of the query and thus, reducing the memory footprint and network bandwidth to the absolute required minimum.

proxies
[db][get,persons]                            // people reference
    [has,name,eq,marko]                      // person reference
    [ref,person[name:@string,
                phones:[(home:@string)?]]]   // person reference
    [get,phones]                             // phones reference
    [get,home]                               // string instance

The equivalent query using a MongoDB query document is provided below.

db.persons.find({name:marko,phones:{home:{$exists}}},{phones.home:1})
Co-Locations

Compile- vs. Run-Time

For a query language to be mm-ADT compliant, there must exist a compiler that transforms queries written in the query language’s syntax to mm-ADT bytecode. Given that mm-ADT bytecode is always written with respects to the model-ADT’s primary structure, the compiler is not required to understand, nor factor-in, the storage system’s secondary structures.

mm-ADT bytecode is optimized using two primary techniques.

  1. Algebraic Equivalences: mm-ADT bytecode is a structural representation of a stream-computing process graph. Stream ring theory’s algebra can be used for compile-time alterations of the process graph.

  2. Reference Redirections: mm-ADT bytecode is written for the model-ADT’s primary structure. References enable the runtime redirection of the processor in order to exploit more efficient data access paths through the structure graph.

The reference graph and [cost] instruction can be used for both compile- and run-time bytecode optimization.

Compile-Time Optimization
Runtime Optimization

model-ADT Structures and Processes

A database model is a type of data model that determines the logical structure of a database and fundamentally determines in which manner data can be stored, organized and manipulated. The most popular example of a database model is the relational model, which uses a table-based format.

— Wikipedia Article on Database Model

mm-ADT can be used to model popular database ADTs. However, because mm-ADT’s objects and instructions have pre-defined semantics, it is typical for most model-ADT’s to not be in one-to-one correspondence with (a subset of) mm-ADT. In such situations, the model-ADT needs to be embedded. Fortunately, these embeddings are not as complex as model-to-model embeddings (e.g. property graph to relational) as mm-ADT provides a flexible data model and universal instruction set. This section presents a collection of popular database model-ADT’s, demonstrates their mm-ADT embedding, and outlines secondary structures used to optimize common query patterns.

Key/Value Store model-ADT

kv intro A key/value store’s primary data structure is a hash table. This ADT supports three basic operations: get, put, and drop. A value is put into the store via its associated key. A value is retrieved from the store via its associated key. A value is removed from the store via its associated key. All keys must be unique. The key/value-ADT’s primary structure can be physically distributed with relative ease as each key/value pair is an atomic datum. Distributed key/value stores are based on distributed hash table principles. The classic key/value model-ADT will be presented first and then contemporary extensions will be explored.

Primary Structure

A classic key/value model-ADT
[model,kv=>mm,
  [define,key,@number|@string|@bytes],                                        (1)
  [define,value,@bool|@number|@string|@bytes|@sequence],                      (2)
  [define,keyvalue,[@key~,@value] => [[db][has,0,eq,$0]]                      (3)
    -> [put,0,@key] => [error]                                                (4)
    -> [drop,0|1]   => [error]],                                              (5)
  [define,db,keyvalue*                                                        (6)
    -> [put,keyvalue[@key~,@value~]] => [coalesce,[[has,0,eq,$0][put,1,$1]],
                                                  [put,[$0,$1]]]              (7)
    -> [dedup,[get,0]]               => []]]                                  (8)
1 The supported key types.
2 The supported value types.
3 A keyvalue is a list containing a key and a value.
4 A keyvalue’s key is immutable.
5 A keyvalue must always contain a key and a value.
6 A key/value store is a stream of keyvalue objects.
7 If a keyvalue with provided key exists, update its value, else insert provided keyvalue (update or create).
8 All keyvalue keys are unique.

Secondary Structures

The simplicity of the key/value data structure and the limited expressivity of its operations means that there are few secondary structures that can be leveraged in a meaningful way. The most important secondary structure is a hash-based index on key. Other indices that support other predicates beside equality may be useful. If the key/value pairs are sorted, then range queries based on key are more efficient to execute. Finally, most key/value stores support the parallel processing of all key/value pairs using the MapReduce pattern. The efficient implementation of this reference path requires a parallel, distributed processor whose workers are able to directly access data local partitions of the full key/value pair stream.

kv reference paths
reference path structure description

[has,0,eq,@key]

index

key/values indexed by key equality

[has,0,@pred,@key]

index

key/values indexed by key predicate

[put,@keyvalue]

function

storage system performs "update or create" locally

[count]

aggregation

key/value count

[order,[gt|lt],[get,0]]

sort order

key/values sorted by key

[has,0,between,[@key{2}]]

sort order/index

key/values sorted by key and accessible by range

[group,@bc{3}]

processor

key/value processing via MapReduce

There is no standard query language for key/value-stores. They are typically interacted with via a language-specific API. An example mm-ADT-bc source is provided below that demonstrates the use of the various defined instructions.

An mm-ADT-bc example using the key/value model-ADT
[db][count]                            // 0
[db][put,[a,[name:marko,age:29]]]
    [put,[b,[name:vadas,age:27]]]
    [put,[c,[name:josh,age:32]]]
    [put,[d,[name:peter,age:10]]]
[db][put,[d,[name:peter,age:35]]]      // updated d value
[db][count]                            // 4
[db][has,0,eq,a]
    [get,1]                            // [name:marko,age:29]
[db][group,[[get,1]
            [get,age]
            [ifelse,[is,gt,30],
                    [obj,old],
                    [obj,young]]],
            [obj,1],
            [sum]]                     // [old:2,young:2]

Property Graph model-ADT

A property graph is network data structure composed of a set of vertices connected by a set of edges. In a directed binary graph, edges connect two vertices: a tail vertex and a head vertex. If the graph is multi-relational, then edges are typed. If the graph is attributed, then vertices and edges have key/value properties. Different graph database providers have slightly different property graph models. The model popularized by Apache TinkerPop3 will be presented first and then common variations will be discussed.

pg intro

Primary Structure

A classic property graph model-ADT
[model,pg=>mm,
  [define,key,string<neq(id)&neq(label)>],                                             (1)
  [define,value,@bool|@number|@string],                                                (2)
  [define,property,@key:@value]                                                        (3)
  [define,element,[id:@object,label:@string,@property*]                                (4)
    -> [drop,(id|label)]              => [error]                                       (5)
    -> [put,(id|label),@object]       => [error]                                       (6)
    -> [is,eq,@element[id:@object~]]  => [has,id,eq,$0]],                              (7)
  [define,vertex,element[id:@object~,inE:@edge*,outE:@edge*]                           (8)
                  => [[db][has,id,eq,$0]]                                              (9)
    -> [get,outE] => (edge[outV:$_]*
                       -> [put,@edge~] => [branch,[put,$0],
                                                  [[get,inV][get,inE]
                                                            [put,![ref,$0]]]]),        (10)
    -> [drop]     => [[branch,[_],
                              [get,inE],
                              [get,outE]]
                      [drop]]],                                                        (11)
  [define,edge,element[id:@object~,outV:@vertex,inV:@vertex]                           (12)
                                => [[get,outV][get,outE][has,id,eq,$0]]                (13)
    -> [put,(outV|inV),@vertex] => [error]                                             (14)
    -> [drop,(outV|inV)]        => [error]                                             (15)
    -> [drop]                   => [[branch,[_],
                                            [[get,inV][get,inE][has,id,eq,$_0]]]
                                    [drop]]],                                          (16)
  [define,db,vertex*                                                                   (17)
    -> [dedup,[get,id]] => []                                                          (18)
    -> [get,outE] => (edge* -> [dedup,[get,id]] => [])]]                               (19)
1 Property keys are strings, but can not be id or label.
2 Property values can only be simple primitives.
3 Properties are key/value pairs.
4 Elements has an id, label, and any number of other properties.
5 Element ids and labels are immutable.
6 Element ids and labels are immutable.
7 An element is equal to another element if their ids are equal.
8 A vertex is a subtype of element with incoming and outgoing edge streams.
9 A vertex’s data access path is in the [db] stream.
10 Adding an edge to a vertex’s outgoing stream adds it to the incoming stream of the adjacent vertex.
11 Deleting a vertex deletes the vertex and all its incoming and outgoing edges.
12 An edge is a subtype of element with outgoing (tail) and incoming (head) vertices.
13 An edge’s data access path is in the out vertex’s outgoing edge stream.
14 An edge’s vertices are immutable.
15 An edge’s vertices are immutable.
16 Delete the canonical edge and the reference edge in the adjacent vertices incoming edges.
17 The graph [db] and is a stream of vertices.
18 Vertices have unique ids.
19 Edges have unique ids.

Secondary Structures

Common secondary structures are presented in the following tables. The reference paths are ordered by their frequency of use. For instance, accessing a vertex by its id or vertices by their label or a property are extremely common in practice. It is important that an implementation provide secondary structure support for these high-ranking patterns as resolving them via the primary structure will greatly reduce the usefulness of the embedding.

Graph Reference Paths

pg graph reference paths

reference path structure description

[has,id,eq,@long]

index

vertices indexed by id

[has,label,eq,@string]

index

vertices indexed by label

[has,@key,@pred,@value]

index

vertices indexed by property

[has,label,eq,@string]
[has,@key,@pred,@value]*

index

vertices indexed by a composite of label and properties

[has,label,eq,@string]
[order,[gt|lt,[get,@key]]]

index/sort order

vertices indexed by label and sorted by property

[count]

aggregate

vertex count

[has,label,eq,@string]
[count]

aggregate

vertex count by label

[get,(inE|outE)]
[count]

aggregate

edge count

[get,(inE|outE)]
[has,label,eq,@string]
[count]

aggregate

edge count by label

Vertex Reference Paths

pg vertex reference paths

reference path structure description

[get,(inE|outE)]
[has,id,eq,@object]

index

incident edge by id

[get,(inE|outE)]
[has,label,eq,@string]

index

incident edges indexed by label

[get,(inE|outE)]
[has,@key,@pred,@value]

index

incident edges indexed by property

[get,(inE|outE)]
[has,label,eq,@string]
[has,@key,@pred,@value]*

index

incident edges indexed by a composite of label and properties

[get,inE]
[filter,[[get,outV][is,eq,@vertex]]]

index

incident incoming edges indexed by adjacent vertex

[get,outE]
[filter,[[get,inV][is,eq,@vertex]]]

index

incident outgoing edges indexed by adjacent vertex

[get,(inE|outE)]
[count]

aggregate

incident edge count

[get,(inE|outE)]
[has,label,eq,@string]
[count]

aggregate

incident edge count by label

[get,(inE|outE)]
[has,label,eq,@string]
[order,[gt|lt,[get,key]]]

index/sort order

incident edges indexed by label ordered by property

[get,(inE|outE)]
[has,label,eq,@string]
[drop]

index/aggregate

incident edges by label dropped

[get,(inE|outE)]
[drop]

aggregate

incident edges dropped (fast vertex delete)

[get,inE]

denorm

incoming edge references as instances

Edge Reference Paths

pg edge reference paths

reference path structure description

[get,(outV|inV)]
[get,id]

denorm

incident vertex id colocated with edge

[get,(outV|inV)]
[get,label]

denorm

incident vertex label colocated with edge

[get,(outV|inV)]
[get,@key]

denorm

incident vertex property colocated with edge

Example Compilations

Each table below presents an example graph traversal using Gremlin and two respective compilations to mm-ADT bytecode. If a reference path is available (given the previously presented secondary structures), then the path is written on a single line. Thus, the fewer lines of bytecode, the more secondary structures are being used to solve the query.

language How many knows edges exist in the graph?

Gremlin

g.V().outE('knows').count()

mm-ADT-bc
(primary)

[db]
[get,outE]
[has,label,eq,knows]
[count]

mm-ADT-bc
(primary
+ secondary)

[db][get,outE][has,label,eq,knows][count]

language What are the names of the people that marko knows well?

Gremlin

g.V().has('name','marko').outE('knows').has('weight',gt(0.85)).inV().values('name')

mm-ADT-bc
(primary)

[db]
[has,name,eq,marko]
[get,outE]
[has,label,eq,knows]
[has,weight,gt,0.85]
[get,inV]
[get,name]

mm-ADT-bc
(primary
+ secondary)

[db][has,name,eq,marko]
[get,outE][has,label,eq,knows][has,weight,gt,0.85]
[get,inV][get,name]

Resource Description Framework model-ADT

rdf intro The Resource Description Framework (RDF) is a graph-based data model developed by the W3C. RDF’s primary structure is a sequence of triples (or quads) called statements. These statements are queried using an expressive, easy-to-use graph pattern-matching language called SPARQL. The components of a statement are web-addressable URIs as RDF was initially intended to be integrated into a world-wide knowledge-graph known as the Semantic Web. However, RDF stores are generally useful as a closed-world database.

Primary Structure

An triple-based RDF model-ADT
[model,rdf,
  [define,uri,string<regex('//^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?)'>], (1)
  [define,literal,@string|@number|@bool]                                                     (2)
  [define,bnode,string<regex('_:.*')>]                                                       (3)
  [define,statement,[s:(@uri|@bnode)~,
                     p:@uri~,
                     o:(@uri|@bnode|@literal)~] => [[db][is,eq,$_]]                          (4)
    -> [is,eq,@statement]  => [[has,s,eq,$_0][has,p,eq,$_1][has,o,eq,$2]]                    (5)
    -> [put,s|p|o,@object] => [error]                                                        (6)
    -> [drop,s|p|o]        => [error]]                                                       (7)
  [define,db,statement*                                                                      (8)
    -> [dedup] => []                                                                         (9)
    -> [put,@statement~] => [[coalesce,[[db][is,eq,$0]],[[put,$0]]][db]]]]                   (10)
1 A uri is a string that matches the provided regular expression pattern.
2 A literal is a primitive object
3 A blank node is a locally resolvable URI
4 A statement is record containing a subject, predicate, and object.
5 Two statements are equal if they have the same subject, predicate, and object.
6 A statement is immutable.
7 A statement is immutable.
8 An RDF-store is a stream of statements.
9 An RDF-store does not contain duplicate statements.
10 If the statement already exists, do nothing, else insert statement.

Secondary Structures

rdf reference paths
reference path structure description

[has,s,eq,@uri|@bnode]

index

statements indexed by subject

[has,p,eq,@uri]

index

statements indexed by predicate

[has,o,eq,@uri|@bnode|@literal]

index

statements indexed by object

[has,s,eq,@uri|@bnode]
[has,p,eq,@uri]

index

statements indexed by a composite of subject/predicate

[has,o,eq,@uri|@bnode|@literal]
[has,p,eq,@uri]

index

statements indexed by a composite of object/predicate

[has,s,eq,@uri|@bnode]
[has,p,eq,@uri]
[at,[db],
 [[has,s,eq,![get,o]]
  [has,p,eq,@uri]]]

denorm

length-2 statement paths pre-joined by object/subject

[count]

aggregate

statement count

[has,p,eq,@uri]
[count]

index/aggregate

statement count by predicate

Example Compilations

Each table below presents an example SPARQL pattern-match and two respective compilations to mm-ADT bytecode. If a reference path is available (given the previously presented secondary structures), then the path is written on a single line. Thus, the fewer lines of bytecode, the more secondary structures are being used to solve the query.

language How many knows edges exist in the graph?

SPARQL

SELECT COUNT(?s) WHERE {?s ex:knows ?o}

mm-ADT-bc
(primary)

[db]
[has,p,eq,ex:knows]
[count]

mm-ADT-bc
(primary
+ secondary)

[db][has,p,eq,ex:knows][count]

language What is the average age of the people that marko knows?

SPARQL

SELECT AVG(?b) WHERE {
  ex:marko ex:knows ?a
  ?a ex:age ?b .}

mm-ADT-bc
(primary)

[db]
[has,s,eq,ex:marko]
[has,p,eq,ex:knows]
[at,[db],
 [[has,s,eq,![get,o]]
 [has,p,eq,ex:age]]]
[get,o]
[mean]

mm-ADT-bc
(primary
+ secondary)

[db][has,s,eq,ex:marko][has,p,eq,ex:knows][at,[db],[[has,s,eq,![get,o]][has,p,eq,ex:age]]]
[get,o]
[mean]

language What project did marko’s colleague create with him who also created ripple?

SPARQL

SELECT ?a ?b WHERE {
  ex:marko ex:knows ?a
  ex:marko ex:created ?b
  ?a ex:created ?b
  ?a ex:created ex:ripple . }

mm-ADT-bc
(primary)

[db]
[match,
 [[at,[db],
  [has,s,eq,ex:marko]]
  [has,p,eq,ex:knows]
  [get,o]
  [as,a]],
 [[at,[db],
  [has,s,eq,ex:marko]]
  [has,p,eq,ex:created]
  [get,o]
  [as,b]],
 [[path,a]
  [at,[db],
   [has,s,eq,![_]]]
  [has,p,eq,ex:created]
  [get,o]
  [as,b]],
 [[path,a]
  [at,[db],
   [has,s,eq,![_]]]
  [has,p,eq,ex:created]
  [has,o,eq,ex:ripple]]]

mm-ADT-bc
(primary
+ secondary)

[db]
[match,
 [[at,[db],[has,s,eq,ex:marko]][has,p,eq,ex:knows]
  [get,o]
  [as,a]],
 [[at,[db],[has,s,eq,ex:marko]][has,p,eq,ex:created]
  [get,o]
  [as,b]],
 [[path,a]
  [at,[db],[has,s,eq,![_]]][has,p,eq,ex:created]
  [get,o]
  [as,b]],
 [[path,a]
  [at,[db],[has,s,eq,![_]]][has,p,eq,ex:created][has,o,eq,ex:ripple]]]

Document model-ADT

A document database is composed of a set of named collection containing documents. Popular document structures include XML, JSON, and YAML. Document are hierarchical (tree-like) and isolated from each other. That is, documents do not refer to each other. Document isolation facilitates distribution as each node in the cluster is responsible for a range of documents (hashed by a shared field such _id). A collection is searched using document pattern-matching. A proto-document containing predicates is submitted as a query and all documents that match the query document are accessible via a result iterator. This section will present a JSON-based document model-ADT.

Primary Structure

A Document model-ADT
[model,doc,
  [define,document,dobject[_id:@string]],
  [define,dobject,[(@string:@value)*]],
  [define,dlist,[@value*]],
  [define,dvalue,@string|@number|@bool|@dobject|@dlist],
  [define,db,[(@string:@document*)*]
    -> [dedup,[get,_id]] => []]]

Secondary Structures

reference path structure description

[has,_id,eq,@string]

index

documents indexed by _id value

[get,@string]
[has,@string,eq,@dvalue]+

index

collection documents indexed by composite of first-level fields

[get,@string]
([get,@string]
 [is,@pred,@dvalue]*)+

index

collection documents indexed by nested pattern

[get,@string]
[is,typeof,@type]

index

collection documents indexed by pattern-matching

[get,@string]
[count]

aggregate

collection document count

[drop,@string]

function

delete a collection of documents

[get,@string]
[dedup,[get,@string]+]

function

collection documents dedup’d by field

[group,@bc{3}]

processing

document processing via MapReduce

db.persons.insertOne({name:marko,
                      age:29,
                      projects:[{name:lop,lang:java}],
                      knows:[{name:josh},{name:vadas}]})
db.persons.find({"knows.name":{$regex:/^j.*/}},{name:1,age:1})
[db][get,persons]
    [put,[name:marko,
          age:29,
          projects:[[name:lop,lang:java]],
          knows:[[name:josh],[name:vadas]]]]
[db][get,persons]
    [filter,[[get,knows]
             [get,name]
             [is,regex,'^j.*']]]
    [ref,[name:@string,age:@int]]

Relational model-ADT

Wide-Column model-ADT

Heap model-ADT

Serialization

Textual Representation

Writing Bytecode

Reading Bytecode

Binary Representation

Writing Bytecode

Reading Bytecode

Glossary

Abstract Data Type (ADT)

A data structure specification containing data types, instructions, and their operational semantics. An instance of an ADT is a data structure operated on with instructions from the ADT instruction set.

ADT embedding

ADT-1 can be losslessly encoded in ADT-2 if ADT-2’s data types and instructions form a mathematically equivalent superset of ADT-1’s specification. An embedding requires the existence of a homomorphism from the data types and instructions of ADT-1 to the data types and instructions of ADT-2.

Anonymous type

A type that has no defined name symbol.

Bytecode

A linear, nested collection of instructions from the mm-ADT instruction set. Bytecode represents the functional composition of instructions into a procedural program used to read/write mm-ADT data structures.

Denormalization

The duplication of data within the primary structure as a means of memoizing frequented data access paths.

Dereference

The conversion of a reference to its referent.

Embedding

An ADT modeled using the data types and operations of another ADT.

Function reference

A reference containing instruction objects that more optimally executed by the storage system. Examples include count, max, group, dedup, etc. instructions.

Index

An secondary structure that provide sublinear time access to objects in the primary structure. In mm-ADT, an index is any reference with a filter-instruction to a resultant reference containing a subset of the original reference’s referents.

Instruction

An atomic unit of processing. An instruction maps a sequence of zero or more input objects to a sequence of zero or more output objects. Instructions are composed to express complex computations.

mm-ADT

A multi-model ADT composed of primitive and sequence data types and map, filter, reduce, and branch instructions.

model-ADT

An ADT with a known mm-ADT encoding.

Named type

A type pattern that has a defined name symbol.

Noop reference

A reference with instructions that map to [].

NoSQL

A post-millennium technology movement focused on the development of scalable, distributed database technologies via non-relational ADTs that can more readily be distributed at the expensive of query expressivity.

Instance

An object that can match one and only one structure: it’s self.

List

An mm-ADT sequence data type composed of ordered objects accessible by an integer index.

Pattern

A pattern is a data structure that describes a data structure. A pattern may be general enough that is describes many structures (type) or it may be specific enough that it describes only one structure (instance).

Pointer reference

A reference used to combine data into complex data structures.

Processing engine

A software program that can generally execute functions on data. An mm-ADT compliant processing engine can generate an execution plan from submitted bytecode. A processing engine manipulates an mm-ADT storage system’s data structure as a means of computing on the embedded model-ADT.

Proxy reference

A sparsely populated reference to a remote referent.

Query language

A human-oriented computer language for controlling a processing engine. While the term "query" implies read-only, in practice, it also includes mutations. An mm-ADT compliant query language maintains a compiler from the query language to mm-ADT bytecode.

Bytecode strategy

A transpiler for mapping a bytecode’s instruction-tree to another instruction-tree. Optimization-oriented bytecode strategies leverage data statistics and expression equivalences to generate an instruction-tree that is more efficient to execute yet semantically equivalent to the original bytecode instruction-tree.

Record

An mm-ADT sequence data type composed of key/value pairs called fields.

Reference

A proxy to an referent object. References maintain the type of and data access path to the referent.

Referent

An object that is referred to by a reference.

Schema

A specification of the structure of data. mm-ADT types can be used to model database schema constraints.

Storage system

A software program that physically encodes a data structure onto disk, into memory, or distributed across a multi-machine compute cluster. Beyond data storage, these systems typically manage transactions, concurrency, distribution, caching, backup, and other persistence related features.

Stream

A unbounded sequence of objects accessible only through iterative, lazy evaluation.

Substream reference

A reference to a subset of the referents of another referent. Database indices are modeled as substream references.

Subtype

Type X is a subtype of type Y if X matches a subset of the objects matched by Y.

Type

An object that matches more than one structure.

Conclusion

Specification Roadmap

Built-In Type System
  • Should there only be integer and floating as short,long, double, etc. can be modeled as constraints.

  • What is the general nature of the primitive operators?

    • Is the [op] instruction a good idea (type safety issues)?

    • Should math be contained in a single instruction [math,expression]?

Instruction Set
  • The instruction set is about 75% complete.

  • Many instructions from TP3 are no longer needed.

    • given generalized bytecode arguments.

    • given all structures are lists or records.

  • [unfold] is a "sneaky" data access method (consider it in models)

Exception Handling
  • There is simply an [error] instruction (with no specification).

  • Are there different levels of failure?

  • Can errors be caught? (exception handling)

Type Derivations
  • Is the & and | model of combining objects to create types and instances good?

  • The model just sort of popped out without much thought.

Transaction Handling
  • Thus far it is assumed bytecode is contained in a transaction.

  • Does mm-ADT need to consider transactions?

Object Threads
  • TP3 uses "traversers" and stream ring theory uses "object threads."

  • How should we specify paths and sacks?

    • Stream ring theory’s object metadata model is more general than TP3.

    • Stream ring theory’s object metadata model is a register-based system.

      • "Traversers" are realized as mini-CPUs.

mm-ADT-bc Compiler
  • Developer an ANTLR-based compiler.

  • Ensure no ambiguities in the mm-ADT-bc language.

  • Work towards a syntax that provides an easy to develop parser.

mm-ADT Prototype
  • Develop a Java-based mm-ADT object API.

  • Study the practical nature of references.

Cost instruction
  • Develop [cost] as a way to determine the expense of various data paths.

Specify more model-ADTs
  • Finish up document model-ADT.

  • Develop the heap model-ADT as an example of a general-purpose computer.

  • Develop property graph model-ADT variants (n-ary, multi-props, meta-props, etc.)

  • Develop a hypergraph model-ADT.

  • Develop a matrix model-ADT for expressing matrix computations.

    • This should really test the expressivity of the instruction set.

Discuss the Practical Stream Computing
  • Akka-style message passing (multi-machine)

  • Spark-style serial scans (multi-machine)

  • Pipes-style chained iterators (single-machine)

  • Rx-style push-based semantics (single-machine, multi-threaded)

  • Most of this is already implemented in tp4/ branch.

Bytecode Optimization through Embedding
  • Is it possible to represent bytecode strategies as automorphic [model] embeddings (bc⇒bc).

Supertype for Record and List
  • struct and stream, where list and record are subtypes of struct. ?

Co-Location of Instance and Reference
  • Develop a way for mm-ADT objects to migrate throughout a cluster.

  • Think in terms of a single mmADTInputFormat and a message passing instruction subset for query routing.

The Database Object
  • The [db] instruction and object need to be thought through more clearly.

  • Does every node in the cluster have a db object?

  • Is the db object provide access to both storage and processing?

  • How are storage systems and processing engines connected to/configured?

Include the Coefficient Ring
  • Stream Ring Theory is founded on the product of two rings: the function ring and the coefficient ring.

  • Develop a syntax for coefficients in bytecode and coefficients in objects.

Develop the Stream Ring Theory Section
  • Provide mm-ADT-bc rewrite rules that implement the axioms and theorems of stream ring theory.

  • Demonstrate compilation via algebraic manipulation.

The Role of the Match Instruction
  • Should this be implemented using more primitive instructions?

  • The runtime cost should be handled by a reference.

  • Can an ingenious reference structure yield match semantics without the actual instruction?

Specify Casting
  • How do you change the type of an object?

    • We currently just [ref] it, but this is only useful for streams.

  • How do you redefine a type? [define]?

Namespaces
  • As we include more objects and models, it will be necessary to develop a namespacing system.

Join Algorithms
  • The [at] instruction is providing nested loop join semantics.

  • Need better examples demonstrating sort-merge join and the role of references in the decision process.

  • A rough example of a hash join implementation was demonstrated using instruction rewrites.

    • This was interesting in that the reference did a complex computation and stored results.

    • References as complex computational entities?

  • If we can add some form of memoization (aggregate/barrier), we can do describe/execute more efficient joins.