Release Notes

mm-ADT is an early stage project. This 0.1-alpha version of the specification is being publicly released in the hopes of receiving feedback on the technical aspects of mm-ADT as well as the social aspects as we search for an appropriate venue to continue its development. There are various design hurdles that are difficult to overcome using pure reason alone. While the specification is being reviewed, the team will transition to prototyping an mm-ADT-bc compiler. The specification roadmap provides an overview of the known issues that need to be addressed as the project moves forward.

mm-ADT is currently being funded by DataStax and RReduX.

## Introduction

An abstract data type (ADT) is a mathematical model for data types, where a data type is defined by its behavior (semantics) from the point of view of a user of the data, specifically in terms of possible values, possible operations on data of this type, and the behavior of these operations. This contrasts with data structures, which are concrete representations of data, and are the point of view of an implementer, not a user.

— Wikipedia Article on Abstract Data Type

This specification defines a multi-model abstract data type (mm-ADT) capable of representing the logical structure and idiomatic processes of common database data types (model-ADT) such as relational tables, n-ary property graphs, RDF statements, document trees, wide-column tuples, key/value pairs, and any hybrid or novel ADT to come.

mm-ADT’s data model and operations encapsulate the many structural and procedural patterns common to all databases. Auxiliary structures such as indices, denormalized data locations, sort orders, and schemas are implicitly exposed to the optimizer (at compile-time) and the processor (at runtime) via data access paths that exist outside the model-ADT’s primary, logical structure. By abstracting away the physical encoding and supporting optimizations, query language compilers and processing engines are able to benefit from the custom features of the underlying storage system all the while remaining agnostic to their existence. mm-ADT provides a common integration point for storage systems, processing engines, and query languages to effectively commingle, regardless of their respective model-ADT orientations.

### NoSQL Databases

At the turn of the millennium, the database space was primarily populated by SQL-driven relational databases. The millennium’s second decade saw an explosion of open source database technology focused on providing distributed, fault tolerant storage at the expense of simpler query and transactional semantics. "NoSQL" databases abandoned the expressive, general-purpose relational-ADT in favor of new ADTs tailored to satisfy particular application requirements.

• Graph Database: Relational database rows are coupled at query-time via an explicit `JOIN`-operation. Graphs are "pre-joined" structures. In the graph-ADT, vertices are coupled via edges and graph-oriented query languages, capitalizing on the a priori existence of these links, provide users a concise traversal and pattern-matching syntax not available in relational-ADT oriented languages such as SQL.

• Key/Value Store: Relational databases are bound to the relational-ADT and the real-world limitations created by its expressiveness. If an application only requires simple `PUT` and `GET` access to a massive amount of data indexed by unique keys, then while a relational query language can support the query requirements, a relational storage system may not support the scaling requirements. Key/value databases are easier to scale because the key/value-ADT, with its logically partitioned data model and basic operations, naturally support the data’s physical distribution across a multi-machine compute cluster.

• Document Database: The key/value-ADT does not maintain operations for the internal analysis of the value data. If the value’s data can be represented as a tree-like "document" then a document database can be used. The document-ADT extends the key/value-ADT with structured values (i.e. documents), value-aware indices, and respective query operations capable of traversing the internal structure of the values.

• RDF Store: Long before the NoSQL database movement, the W3C developed a web-oriented data model called the Resource Description Framework (RDF). The RDF-ADT is structured as a single unordered collection of statements (triples or quads). Each statement has a subject, predicate, and object component, where complex statement patterns are queried using a pattern-matching language called SPARQL. Similar to the aforementioned graph query languages, SPARQL does not have a `JOIN` operator as RDF statements are implicitly linked when any of their statement components are equal.

• Wide-Column Store: The wide column-ADT is generally understood to be a relational-ADT with a relaxed data model and no `JOIN`-operation. Rows are organized into tables, however, rows of the same table aren’t required to have the same structure (i.e. tables are not required to be `nxm`-matrices). In fact, rows within the same table can have different lengths, where in practice, multi-million column rows are used for data denormalization and the creation of endogenous, ADT-encoded indices.

### Multi-Model Databases

To date, databases remain primarily classified by their ADT. While database ADTs are strongly influenced by the storage system’s physical structure (both on-disk and in-memory) and the supported operational primitives of the processing engine, it is ultimately the lexicon and prescribed data-perspective of the query language that solidifies the database’s category. At the level of the processing engine and storage system, many superficial qualities distinguish tables/rows, vertices/edges, statements, and documents. In fact, many database processing engines and storage systems have features that are generally useful across ADTs such as processors supporting stream-oriented parallel and distributed computing with an extensible set of functional primitives and storage systems supporting distribution, fault-tolerance, replication, caching, indexing, and transactions.

The multi-model database trend is founded on the realization that with a sufficiently complex database ADT, multiple model-ADTs can be embedded and when doing so, concepts such as a row, a document, and a vertex lose their meaning as rows and documents can reference each other and vertices and edges can be joined, selected, and projected. However, for multi-model database engineers, while it is important to understand the data model and operations of their target-ADTs, it is equally important to understand the optimization techniques and query idioms used in practice by those databases designed specifically for those model-ADTs. Ultimately, a multi-model database will come to support various indexing, caching, and denormalization techniques, different serial and random access threaded and distributed processors, and either multiple ADT-specific query languages or a single query language that can naturally express all their supported model-ADTs’ idioms. At the limit, when storage system representations, processing engine features, and query language syntaxes are sufficiently malleable to effectively and efficiently structure and process any known ADT instance, the database industry may transition out of its variegated "Tower of Babel"-existence into the world of universal, general-purpose computing much like the hardware and programming language industries did so many decades before.

### Intended Audience

This specification presents a database virtual machine architecture. An mm-ADT virtual machine is programmed by a query language and evaluated by a processing engine against data persisted in a storage system. However, this specification can also be used for the intermediate case of query language translation.

Query Language Translation

The first use case translates a query that is oriented for processing an X model-ADT into a query that is oriented for processing a Y model-ADT via a process called embedding. For example, a graph database query can be translated to a relational database query, where the graph structure is realized as a foreign key-based, first normal form relational structure with `JOIN` operations replacing traversal operations. model-ADT embedding is not always possible at the query translation-level as the Y model-ADT may lack the expressivity required to capture the semantics of the X model-ADT query. For universal multi-model support, an mm-ADT database virtual machine can be implemented.

Database Virtual Machine

The second use case is for database architectures that have loosely coupled language compilers, processing engines, and storage systems. A query is compiled to language-agnostic mm-ADT bytecode. That bytecode represents a computation aimed at the primary, logical structure of the language’s intended model-ADT. If the language’s model-ADT is different from the storage system’s model-ADT, the bytecode undergoes a translation process defined by an embedding. A processor then executes the resultant bytecode using mm-ADT objects which provide bi-directional communication with the storage system. Database architectures leveraging an mm-ADT virtual machine allow any query language to program any processor to compute over any storage system, where the distinction between implementations lies in the time/space trade-offs made in the respective components' internal designs.

Given these two use cases, the ideas and architecture captured herein may prove useful to the following groups.

• Database providers: The creators of database products develop and maintain a diverse array of technologies from query to storage. mm-ADT fosters a looser coupling between these technologies. The second part of this specification discusses the secondary structures that are typically used to optimize queries in the respective model-ADT community. mm-ADT provides an ADT agnostic means of leveraging these optimizations. As such, mm-ADT compliant database providers can widen the scope of their product’s offering with new model-ADT’s and respective query languages.

• Open source database technologists: The open source community has an outstanding number of quality projects that are being integrated into database products at an increasing rate. Examples include batch and stream processors, query language frameworks, distributed storage systems, general-purpose indexing engines, and the like. mm-ADT makes it possible to integrate these components in a standardized manner, thus enabling the proliferation of novel database systems and ADTs.

• Database enthusiasts: Database technology is influenced by numerous core computer science sub-domains such as networking, hardware, data structures, distribution, compilers, etc. mm-ADT provides a single conceptual and computational framework that integrates many of these database considerations and, unlike most database textbooks, this specification is not focused on the relational model and the particulars of its encoding, processing, and optimization techniques.

Most database management systems are organized around a single data model that determines how data can be organized, stored, and manipulated. In contrast, a multi-model database is designed to support multiple data models against a single, integrated backend.

— Wikipedia Article on Multi-Model Database

The presented multi-model abstract data type (mm-ADT) facilitates the embedding of common database ADTs (model-ADTs) into an mm-ADT compliant data storage system to be manipulated by an mm-ADT compliant data query language whose submitted queries are ultimately evaluated by an mm-ADT compliant data processing engine.

• Storage System (storage-ADT): provides persistence, distribution, transaction, and indexing features.

• Query Language (language-ADT): provides query syntax features and compilation.

Every database component is designed according to a particular model-ADT. For instance, a storage system may have a logical and physical representation optimized for processing relational data. Similarly, a query language may be designed to describe processes over a graph. A graph-based query language is unable to direct processes over a relational storage system because there exists an impedence mismatch between the two models. The role of mm-ADT is to provide both compile- and run-time mappings that bridge disparate models in order for query languages to work with storage systems regardless of their respective model-ADTs and moreover, with existing data.

mm-ADT is not only able to losslessly capture a model-ADT’s logical, primary structure, it also captures secondary data access paths necessary for efficient processing. The following diagram details the relationship between the various components of an mm-ADT database architecture. The concepts presented will be discussed at length over the course of this specification.

Before proceeding, an example demonstrating the two previously discussed intended mm-ADT use cases is provided. The source code below is written in a textual bytecode language called mm-ADT-bc. This code defines a particular (naïve) embedding of a property graph-ADT into a relational-ADT. The defined types have various instruction re-write rules (`-> x => y`) that specify how to read/write `vertices` and `edges` to and from a relational structure composed of `tables` and `rows`. Rewrites also contain instructions for accessing secondary structures such as table indices.

 mm-ADT-bc is a low-level bytecode language similar in nature to a machine assembly language. It is the role of high-level query language designers to develop mm-ADT bytecode compilers for their languages. mm-ADT bytecode can be consumed by any mm-ADT compliant processor and storage system.
``````[model,rdb=>mm,                                                                     (1)
[define,row,[(@string:@bool|@number|@string)*]]
[define,table,row*]
[define,db,[(@string:@table)*]]]

[model,pg=>rdb,                                                                     (2)
[define,vertex[id:@long~,
label:@string~,
name:@string~,
age:@int~,
inE:@edge*,
outE:@edge*] =>
(row[id:$0,label:$1,name:$2,age:$3] => [[db][get,V][has,id,eq,$0]]) (3) -> [get,outE] => ((row* => [[db][get,E]]) => [has,outV,eq,$_0])               (4)
-> [get,inE]  => ((row* => [[db][get,E]]) => [has,inV,eq,$_0]) -> [drop] => [[db][get,E] (5) [or,[has,outV,eq,$_0],
[has, inV,eq,$_0]] [drop] [db][get,V] [has,id,eq,$_0]
[drop]]],
[define,edge[id:@long~,
label:@string~,
weight:@float~,
outV:@vertex[id:@long~],
inV:@vertex[id:@long~]] =>                                           (6)
(row[id:$0,label:$1,weight:$2,outV:$3,inV:$4] => [[db][get,E][has,id,eq,$0]])   (7)
-> [put,(outV|inV),@vertex] => [error]],
[define,db,vertex* => [[db][get,V]],
-> [new] => [[db]                                                               (8)
[put,V,table<row[id:@long,
label:@string,
name:@string,
age:@int]*>   => [[db][get,V]]
-> [has,id,eq,@long~]        => row[id:$0]?] (9) [put,E,table<row[id:@long, label:@string, weight:@float, outV:@long, inV:@long]*> => [[db][get,E]] -> [has,id,eq,@long~] => row[id:$0]?                      (10)
-> [has,outV,eq,@long~]      => (row[outV:$0]* -> [has,label,eq,@string~] => row[outV:$_0,label:$0]*) (11) -> [has,inV,eq,@long~] => (row[inV:$0]*
-> [has,label,eq,@string~] =>  row[inV:$_0,label:$0]*)]]]]     (12)``````
 1 A relational-ADT embedded in mm-ADT (the storage/target-ADT). 2 A property graph-ADT embedded in the relational-ADT (the language/source-ADT). 3 A `vertex` is encoded as a `row` in a `table` called V. 4 Vertex edges are accessible via a table called E. 5 When a vertex is deleted, all its edges are deleted from E. 6 An edge’s incident vertices are accessed via a foreign key to the V table. 7 An `edge` is encoded as a `row` in a `table` called E in first normal form. 8 The property graph-ADT `db` is initialized with two relational-ADT `tables`. 9 The V table has a unique index on `id` (specified unique by `?`). 10 The E table has a unique index on `id` (specified unique by `?`). 11 The E table has a composite index on `outV` and `label`. 12 The E table has a composite index on `inV` and `label`.
 The property graph model-ADT presented above is degenerate. The embedding is not generalized to support schema-less graphs as relational-ADT `tables` require a predefined schema. To keep the example simple, only two `tables` (V and E) were created and thus, all `vertices` and `edges` have a fixed property set. A more complete property graph-ADT model is provided in the property graph model-ADT specification.

The Gremlin graph traversal below creates two vertices and connects them with a knows-edge.

Gremlin traversal creating and connecting two vertices by a knows-edge
``````g.addV('person').property(id,1).
.property('name','marko')
.property('age',29)
.propery('weight',0.5)
.property('age',27))``````

An mm-ADT compiler for Gremlin would generate the following mm-ADT bytecode. Note that this bytecode is oriented towards the primary, logical structure of a property graph.

Gremlin traversal compiled to property graph-ADT bytecode
``````[db][put,vertex[id:1,label:person,name:marko,age:29,inE:<>,outE:<>]]     (1)
[obj,vertex[id:1]]                                                   (3)
[get,outE]                                                           (4)
[put,edge[id:7,label:knows,weight:0.5f,
outV:vertex[id:1],inV:vertex[id:2]]]                       (5)``````
 1 A vertex is added to the storage system. 2 Another vertex is added to the storage system. 3 A reference to the first vertex is generated. 4 A reference to the first vertex’s outgoing edges is generated. 5 A knows-edge from the first to the second vertex is added to the first vertex’s outgoing edges.

In order to execute the above bytecode over a relational storage system, the previously defined embedding provides a mechanical procedure for translating property graph bytecode to relational bytecode.

``````[db][get,V]                                                 (1)
[put,row[id:1,label:person,name:marko,age:29]]          (2)
//  [obj,row[id:1] => [[db][get,V][has,id,eq,1]]]           (4)
[db][get,E]                                                 (5)
[put,row[id:7,label:knows,weight:0.5f,outV:1,inV:2]]    (6)``````
 1 The V table is referenced. 2 A "vertex"-row is added to the V table. 3 Another "vertex"-row is added to the V table. 4 A reference to the first row is generated, but is superfluous as it is never dereferenced. 5 The E table is referenced. 6 An "edge"-row is added to the E table.

If the relational storage system has an SQL processor, then instead of using an mm-ADT processor, the above bytecode can be translated to the corresponding SQL query below. Note the structural similarities between the relational bytecode and the SQL query.

``````INSERT INTO V
(id,label,name,age)
VALUES
(1,'person','marko',29),
INSERT INTO E
(id,label,weight,outV,inV)
VALUES
(7,'knows',0.5,1,2);``````

The Gremlin graph traversal below determines the average age of the people that vertex 1 knows.

Gremlin traversal for determining the average age of the vertices known by vertex 1
``g.V(1).out('knows').values('age').mean()``

A compiler from Gremlin to mm-ADT would generate the following property graph bytecode.

Gremlin traversal compiled to property graph-ADT bytecode
``````[db][has,id,eq,1]
[get,outE]
[has,label,eq,knows]
[get,inV]
[get,age]
[mean]``````

The above bytecode is oriented towards a property graph data structure. In order to leverage the graph representation embedded in the relational structure, the previously defined embedding enables the following translation.

``````[db][get,V]
[has,id,eq,1]                        (1)
[at,[[db][get,E]],
[[has,outV,eq,![get,id]]         (2) (3)
[has,label,eq,knows]            (4)
[at,[[db][get,V]],
[[has,id,eq,![get,inV]]     (5) (6)
[get,age]                  (7)
[mean]]]]]``````
 1 This references an index lookup on table `V` (based on id). 2 This aborts the previous index lookup on table `V` (id is known). 3 This references an index lookup on table `E` (based on outV). 4 This references a composite index lookup on table `E` (based on outV and label). 5 This triggers the previous composite index lookup on table `E` (inV is unknown). 6 This references an index lookup on table `V` (based on id). 7 This triggers the previous index lookup on table `V` (age is unknown).
 The `[at]` instruction implements a nested loop join. For each incoming record, the `[at]`-stream is scanned and processed by the subsequent `[has]`-filters (i.e. the join predicates). In the above bytecode example, where indices are defined, serial scans are avoided and an index nested loop join is enacted.

The relational bytecode can be used by a processor to query the storage system or, if the storage system supports SQL, the bytecode can be translated to the SQL query below and submitted as such. Both approaches allow relational database users to query their data with a graph traversal language.

``````SELECT AVG(age) FROM V
WHERE id=(SELECT inV FROM E WHERE outV=1 AND label='knows')``````

### Structures

A structural type system (or property-based type system) is a major class of type system, in which type compatibility and equivalence are determined by the type’s actual structure or definition, and not by other characteristics such as its name or place of declaration.

— Wikipedia Article on Structural Type System

Every denotable entity in mm-ADT is an object. Every object is a pattern capable of being used in pattern matching. Objects that match other objects are called types. Objects that match themselves are called instances. Instances lacking their full value, but with enough data to generate themselves (e.g. a primary key) are called references. The meta-structural aspects of types, instances, and references are specified in the table below for an example "person" `record`.

Table 1. The meta-structural components of an mm-ADT object
kind symbol (?) type value data access path

type

`person`

`type`

`[name:@string,age:@int]`

`[[db][has,name,eq,$0]]` instance `person` `[name:marko,age:29]` `[[db][has,name,eq,marko]]` reference `person` `[name:marko,age:@int]` `[[db][has,name,eq,marko]]` An object can have a denoting symbol. The symbol must be respected by all processes. For this reason, typically only types have symbols as propagating symbols in a distributed environment is costly. Every object has a type. A type’s type is `type`. Instances and references have types specified by the type’s denoting symbol, or for anonymous types (discussed later), the type object itself. Instance and reference types can change over the course of a computation. Every object has a value which is its internal structure. Finally, every object has a data access path that specifies the location of the canonical representation of that object. If the object is in local memory and not associated with a storage system, then its data access path is itself.  In mm-ADT, types are best conceptualized as degenerate instances and references are best conceptualized as lazy instances. Objects are created using the built-in, fundamental mm-ADT data types diagrammed below, where the directed arrows denote a subtype of-relationship. Every built-in type is structureless in that is has no meta-structural value (see previous table). All custom types have structural definitions captured by their value. The built-in type hierarchy is the only explicit type hierarchy in mm-ADT. All custom types exist in an implicit, structural type graph, where instances are realized as different types depending on the model-ADT context. The table below presents the tokens of the mm-ADT-bc language used to write mm-ADT bytecode. Table 2. The tokens of the mm-ADT-bc language token example description `@` `@string` An instance of type `string` (short for `typeof(string)`) `!` `![put,age,29]` Insert the field `age:29` (execute bytecode at runtime) `!!` `!![op,add,1]` Add 1 to the incoming number (execute bytecode at compile time) `x(y)` `lt(40)` A `number` less than 40 (execute predicate operator) `&` `@int&gt(32)` An `integer` greater than 32 (short for `and(x,y)`) `|` `'marko'|'vadas'` Either a 'marko' or 'vadas' `string` (short for `or(x,y)`) `{ }` `@float{2}` Two 32-bit floating point numbers (specific quantifier) `{ , }` `@string{2,5}` Two to five `strings` (range quantifier) `{, }` `@int{,2}` Up to two `integers` (an upper bound quantifier) `{ ,}` `@int{2,}` Two or more `integers` (a lower bound quantifier) `*` `@string*` Zero or more `strings` (short for `{0,}`) `?` `@int?` Zero or one `integers` (short for `{0,1}`) `+` `@string+` One or more `strings` (short for `{1,}`) `( )` `(@string|@int)*` Zero or more `strings` or `integers` (denotes pattern grouping) `[ ]` `[@string*]` A `list` of zero or more `strings` (denotes list) `[:]` `[(@string:@int){2}]` A `record` of two `string`/`int` pairs (denotes record) `< >` `<1.23,0.2>` A `stream` of 1.23 and 0.2 (denotes stream) `~` `~x` Store the bound value to the variable `x` `~` Store the bound value based on relative position `$`

`$x` The value bound to the variable `x` `$2`

The 3rd value bound by position

`$_2` The 3rd value bound by position in the previous pattern `$_`

The root object of the pattern (analogous to `this`)

`->`

`-> [hasKey,@string]`

An `instruction` with a `string` argument (denotes instruction)

`=>`

`=> 42.0d`

A mapping to 42.0 `double` (denotes a mapping)

`//`

`// a comment`

A comment lasting until the end of the line

`/**/`

`/* a big comment */`

A multi-line comment block

A Short Introduction to Bytecode

mm-ADT bytecode is discussed at length in the processes section of this specification. For the concepts presented in this section, it is necessary to have at least a rudimentary understanding of the way in which computations are expressed and executed in mm-ADT.

mm-ADT processing is founded on stream computing principles. A stream is an ever expanding and contracting sequence of objects. Streams are processed by functions. A stream function has an incoming stream and an outgoing stream. A stream function maps objects from its input stream to objects in its output stream. If functions are denoted by vertices and streams by edges, then an mm-ADT computation is structured as a directed (semi)-acyclic graph. Such process graphs are defined using mm-ADT bytecode. mm-ADT bytecode has a textual representation called mm-ADT-bc, where the process graph-structure is encoded using S-expressions.

The mm-ADT-bc source below defines a stream with three instructions (functions). The `[branch]` instruction has two bytecode arguments (nested sub-streams) each containing one instruction. A stream is denoted `< >` and the state of the stream after each instruction is specified in the associated comment.

``````[[start,1,2,3],    // <1,2,3>
[is,gt,1],        // <2,3>
[branch,
[fold],         // <[2,3]>
]]                 // <[2,3],9,10>``````

mm-ADT-bc parsers ignore whitespace and commas between instructions. Thus, the above bytecode can also be written in a linear form.

``[[start,1,2,3][is,gt,1][branch,[fold],[op,add,7]]]``

#### Primitives

Primitives form the atomic datum of mm-ADT. A primitive object’s type is known by name only. The table below describes each non-operator primitive type. The regular expression of each type defines a finite state machine capable of recognizing a primitive object written in mm-ADT-bc.

type regular expression example description

`bool`

`true|false`

`true`

A true or false boolean

`bytes`

An 8-bit byte sequence

`int`

`[0-9]+(I|i)?`

`6i`

A 32-bit signed integer

`long`

`[0-9]+(L|l)`

`42L`

A 64-bit signed integer

`float`

`(\.[0-9])?(F|f)?`

`92.65F`

A 32-bit floating point

`double`

`(\.[0-9])?(D|d)`

`101.5d`

A 64-bit floating point

`string`

`'stephen'`

A character sequence

 In mm-ADT-bc, the types `string`, `int`, and `float` have an optional abbreviated syntax. Every unspaced character sequence not in the compiler’s symbol table is typed as a `string`. Any character sequence composed of only numeric characters is typed as an `int`. Finally, any character sequence composed of numeric characters and a single period character is typed as a `float`.
##### Operators

Operators denote simple functions associated with primitive object manipulation. The mm-ADT operators are organized into three categories: predicate, arithmetic, and string.

type constants description

`pred`

`typeof`

Binary predicate functions from `objects` to `bool`

`rel`

```eq neq gt lt gte lte within between```

Relational functions from `objects` to `bool`

`arith`

`add sub mult div pow mod`

Binary arithmetic functions from `numbers` to `number`

`str`

`len split regex`

n-ary `string` functions

#### Sequences

A sequence is a linear chain of objects. Sequences can be structured or unstructured. There are two built-in structured sequences: `lists` and `records`. Structured sequences provide random access to their elements. An unstructured sequence is called a `stream`. The elements of an unstructured sequence are sequentially accessed via iteration. Like the fundamental primitives, the built-in sequences have nominal types while their custom derivations are structurally typed.

type value description

`list`

`[@object*]`

An ordered sequence of objects accessed by position

`record`

`[(@object:@object)*]`

An unordered sequence of pairs accessed by key

`stream`

`@object*`

An unordered sequence of objects accessed by iteration

##### Lists

A `list` is a sequence indexed by an `integer` (`int` or `long`). A `list` can contain heterogeneously typed objects. An example `list` is provided below, where `'josh'` could have been written as `josh` given mm-ADT-bc’s abbreviated syntax for strings.

``[42,'josh',42,true]``

The mm-ADT bytecode instructions for `list`-manipulation are provided in the table below and followed by example bytecode using each instruction.

instruction domain range description

`[get,@integer]`

`list`

`object`

Retrieve the `object` at the specified index

`[put,@integer,@object]`

`list`

`list`

Replace the `object` at the specified index

`[fit,@integer,@object]`

`list`

`list`

Insert the `object` at the specified index

`[fit,@object]`

`list`

`list`

Append the `object` to the end of the list

`[drop,@integer]`

`list`

`list`

Remove the `object` at the specified index

``````[start,[42,josh,42,true]]   // [42,josh,42,true]
[obj,[![get,0],![get,3]]]   // [42,true] (a new list created)
[put,1,42]                  // [42,42]
[fit,1,josh]                // [42,josh,42]
[fit,1,josh]                // [42,josh,josh,42]
[fit,true]                  // [42,josh,josh,42,true]
[drop,1]                    // [42,josh,42,true]``````
 Initial instructions such as `[start]` have no incoming stream. When a computation is initiated, initial instructions emit objects to their outgoing stream. All streams prior to an initial instruction are evaluated, but their output streams are ignored. Thus, the only useful prior streams are those that yield side-effects. All side-effect instructions such as `[put]`, `[fit]`, and `[drop]` take an object from the incoming stream, mutate it, and then place it on the outgoing stream.
##### Records

A `record` is an unordered sequence of key/value pairs called fields. The keys of a `record` are unique. The example `record` below has two fields. Each field has a `string` key (name and age) with respective `string` and `int` values (marko and 29).

``[name:marko,age:29]``

The mm-ADT bytecode instructions for `record`-manipulation are provided in the table below and then followed by an mm-ADT-bc example demonstrating their use.

instruction domain range description

`[get,@object]`

`record`

`object`

Retrieve the value at the specified key

`[put,@object,@object]`

`record`

`record`

Insert the specified key/value (replacement possible)

`[drop,@object]`

`record`

`record`

Remove the field with the specified key

``````[start,[name:marko,age:29]]           // [name:marko,age:29]
[obj,[name:![get,name],
age: ![[get,age][op,add,1]]]    // [name:marko,age:30] (a new record created)
[put,age,29]                          // [name:marko,age:29]
[put,wallet,[5,10,10]]                // [name:marko,age:29,wallet:[5,10,10]]
[drop,wallet]                         // [name:marko,age:29]``````
 The `!` token states that the associated `list` should be interpreted as bytecode. If the `[obj]` instruction, in the example above, did not contain `!` tokens, then the output object would have been `[name:[get,name],age:[[get,age][op,add,1]]]`. Whenever there is ambiguity as to whether the argument should be treated as a `list` or as bytecode, mm-ADT-bc will assume `list`. Therefore, if the argument should be treated as bytecode, the `list` should be prefixed with `!`.
##### Streams

A `stream` is an unbounded, structureless sequence of objects. Streams are fundamental to mm-ADT processing. Bytecode instructions operate on streams of data. Thus, every instruction is ultimately a stream instruction. Examples are presented below.

``````[start,42]                      // <42>
[op,sub,13]                     // <29>
[obj,[name:marko,age:![_]]]     // <[name:marko,age:29]>
[branch,[get,name],[get,age]]   // <marko,29>
[is,typeof,string]              // <marko>
[op,len]                        // <5>
[is,gt,6]                       // < >``````

Streams are lazy evaluation objects. They do not contain objects, but provide access to an object sequence. Thus, they act as references as opposed to instances. The significance of this distinction will be made apparent when discussing the role of references in bytecode evaluation (i.e. query processing). A collection of stream behaviors are enumerated below.

1. `< >` is used to denote a stream.

• `<1,2,3>` is a stream of the integers 1, 2, and 3.

2. Every `object` is implicitly in a stream.

• `<stephen>` is processed equivalently as `stephen`.

3. Streams can not be nested.

• `<43,<21,12>,3>` is equivalent to `<43,21,12,3>`.

4. Streams are not generally matched by `@object`.

• `<marko>` is matched by `@object`.

• `<marko,kuppitz,42>` is matched by `@object{3}`.

A stream can have objects added to it and objects removed from it.

instruction domain range description

`[put,@object]`

`stream`

`stream`

Insert the object into the stream reference.

`[drop]`

`object`

Remove the instance at its data access path (assuming in stream).

#### Custom Types

An mm-ADT type is an object that describes other objects. The diagram on the left details the syntactical components of an mm-ADT-bc type. Every type definition starts with a type value describing the structure of an instance using bound and unbound values. The component to the right of the mapsto token (`=>`) specifies the data access path used to refer to the canonical representation of an instance of that type (typically in a storage system). Types can have zero or more instructions. This is analogous to object oriented programming, where a class can have defined methods. However, unlike object oriented programming, the instructions are always from the predefined mm-ADT instruction set and constrained to instructions of the built-in super-type. Type instructions specify rewrite rules for how to achieve the matching instruction’s semantics given the respective model-ADT’s embedding. Type instructions are prefixed with a step token (`->`). If the processor’s current instruction matches a type instruction of the object being processed, the result to the right of the mapsto token (`=>`) is merged into the instruction’s outgoing stream. A result can be bytecode (a reference to a stream of objects) or a type (in which case, it is the responsibility of the storage system to manifest a matching instance — typically as a reference).

 The type value is the only component required when defining a type.

There are two sorts of types: named types and anonymous types.

• Named type: any type with a corresponding symbol. The symbol serves as a named-constant.

• Anonymous type: any type that does not have a corresponding symbol.

A named type is created using the `[define]` instruction. The following named `person` type has an anonymous `int` subtype for the age value.

``[define,person,[name:@string,age:@int<between(0,150)>]]``

The `[define]` instruction takes a `db` object as input. The `db` object is a globally recognized object denoting the mm-ADT machine. In distributed environments, every node contains a `db` object. When a type is defined, the definition is synchronized across all `db` objects before the computation proceeds. As such, named type are typically not defined by queries, but when an mm-ADT system is initialized.

##### Custom Primitives

mm-ADT primitives can be extended by creating subtypes. A `probability` type is defined below as a subtype of `double`. Everything that is a `probability` is also a `double`, but everything that is a `double` is not a `probability`.

``[define,probability,double<between(0.0,1.0)>]``
 Primitive object types and instances are created with stream notation, where no more than one object can exist in `< >`. While semantically valid, `< >` is necessary as a delimiter when parsing.

The newly minted `probability` can be used in other pattern definitions. For example, the type `likely` is a subtype of `probability`.

``[define,likely,probability<gte(0.85)>]``

Common database and programming language types can be defined in mm-ADT. The source code below specifies SQL’s `varchar`, `short` integers, and complex numbers with `add` and `mult` operators defined. Note that `complex` is not a primitive type, but a subtype of `list`.

``````[define,varchar255,string<![[op,len][is,lte,255]]>]
[define,short,integer<between(-32767,32767)>]
[define,complex,[@double~x1,@double~x2]
-> [op,add, complex[@double~y1,@double~y2]] => complex[add($x1,$y1),add($x2,$y2)]]
-> [op,mult,complex[@double~y1,@double~y2]] => complex[sub(mult($x1,$y1),mult($x2,$y2)),
add(mult($x1,$y2),mult($x2,$y1))]]``````
##### Custom Sequences

If a model-ADT has a data type that is "sequence-like," but requires particular constraints on what can be contained therein, a custom sequence type can be defined. For example, tuples can be defined as a subtype of `list`.

``````[define,pair,[@object{2}]]
[define,triple,[@object{3}]]
[define,person_tuple,[@string,@int]]``````

A relational database `row` is "record-like" in that the keys are column names and the values are entries. However, unlike an mm-ADT `record`, a relational `row` can only have `string` keys (column names) with values that are either `strings`, `numbers`, or `bools` (row entries). A simple relational data model in first normal form is defined below where `rows` are records with primitive values and `tables` are streams of `rows`.

``````[define,row,[(@string:(@string|@number|@bool))*]]
[define,table,row*]``````

The above relational types can be subtyped to create domain-specific schemas. For instance, a `people`-table may contain `person`-rows. A `person` is a subtype of `row` with only two fields: a `string` name and an `int` age. Similarly, `people` is a subtype of `table` containing only `person` objects.

``````[define,person,row[name:@string,age:@int]]
[define,people,table<@person*>]
[start,people<person[name:marko,age:29],
person[name:josh,age:32],
person[name:peter,age:35]>]   // a people stream containing 4 persons``````

If a sequence is created with a type prefix, then the instance must always match the type pattern (at creation and any subsequent mutations). However, it is important to note that the type is not stored with the instance. When an instance is created, its type is the specified type. When an existing instance is retrieved, its type is determined by the model embedding. For example, if a data location is accessed as a `table`, then `row` objects are streamed. However, if the same data location is accessed as `people`, then `person` objects are streamed.

Type and Instance Derivations

The mm-ADT-bc syntax for creating a type is same as the syntax for creating an instance. The distinction between a type and an instance is whether the object’s value contains unbound values. The first source block below demonstrates the mechanical procedure for deriving a type value from a type definition. Note that the lines following each `[define]` are demonstrating the derivation and, in this context, are not valid mm-ADT-bc expressions.

Type value derivations
``````// a person is a record with a string name and an int age
[define,person,[name:@string,age:@int]]
=> [name:@string,age:@int]
=> [![is,eq,name]:![is,typeof,string],![is,eq,age]:![is,typeof,int]]

// a worker is a person that also has a string company
[define,worker,person[company:@string]]
=> [name:@string,age:@int]&[company:@string]
=> [name:@string,age:@int,company:@string]

// a child is a person whose age is less than 10
[define,child,person[age:lt(10)]]
=> [name:@string,age:@int]&[age:lt(10)]
=> [name:@string,age:@int&lt(10)]

// a slave is a child whose is also a worker
[define,slave,child&worker]
=> [name:@string,age:@int&lt(10)]&[name:@string,age:@int,company:@string]
=> [name:@string&@string,age:@int&lt(10)&@int,company:@string]
=> [name:@string,age:@int&lt(10),company:@string]

// a lowly is a child or a worker
[define,lowly,child|worker]
=> [name:@string,age:@int&lt(10)]|[name:@string,age:@int,company:@string]
=> [name:@string|@string,age:@int&lt(10)|(age:@int,company:@string)]
=> [name:@string,age:@int&lt(10)|(age:@int,company:@string)]``````

The source block below demonstrates the creation of instances. Note that in those situations where unbound values can not be bound, a type is created.

Instance value derivations
``````person[name:marko,age:29]
=> [name:@string,age:@int]&[name:marko,age:29]
=> [name:@string&marko,age:@int&29]
=> [name:marko,age:29]
=> [![is,eq,name]:![is,eq,marko],
![is,eq,age]:![is,eq,29]]                                   // instance

person[name:marko]
=> [name:@string,age:@int]&[name:marko]
=> [name:@string&marko,age:@int]
=> [name:marko,age:@int]                                        // type

worker[name:marko,age:29]
=> [name:@string,age:@int,company:@string]&[name:marko,age:29]
=> [name:@string&marko,age:@int&29,company:@string]
=> [name:marko,age:29,company:@string]                          // type

lowly[name:marko,age:29]
=> [name:@string,age:@int&lt(10)|(age:@int,company:@string)]
&[name:marko,age:29]
=> [name:@string&marko,
age:@int&lt(10)&29
|(age:@int&29,company:@string)]
=> [name:@string&marko,
|(age:29,company:@string)]
=> [name:@string&marko,
(age:29,company:@string)]
=> [name:marko,age:29,company:@string]                          // type

lowly[name:little-timmy,age:4]
=> [name:@string,age:@int&lt(10)|(age:@int,company:@string)]
&[name:little-timmy,age:4]
=> [name:@string&little-timmy,
age:@int&lt(10)&4
|(age:@int&4,company:@string)]
=> [name:@string&little-timmy,
age:4                             // <- binding branch
|(age:4,company:@string)]
=> [name:little-timmy,age:4]                                    // instance``````

#### References

Data access is an expensive operation and should be avoided if possible. Unlike typical virtual machine environments (e.g. the JVM), where data is accessible via the physical machine’s random access memory, storage systems (the shared memory system of an mm-ADT machine) encode their data throughout the memory hierarchy and across the network (see data access costs). As such, object values should only be accessed when absolutely necessary. Objects that don’t contain all their values are called references. References serve as proxies to their referent objects. At minimum, a reference must contain enough information to manifest the referent’s data using the associated data access path. Accessing the referent object is called dereferencing. An example `person` type, instance, and reference are presented below.

``````person[name:@string~,age:@int] => [[db][has,name,eq,$0]] // a person type person[name:marko,age:29] // a person instance person[name:marko] // a person reference``````  All mm-ADT optimizations are framed as techniques for avoiding dereferencing or for leveraging spatial and temporal reference information to ensure efficient dereferencing. ##### Dereferencing A reference must have enough information to manifest its referent. If not, the reference is a type. For example, given the following `person` definition, where no data access path is defined, the final instruction does not create a reference, but a subtype of `person`. That is, it matches all `persons` named marko regardless of their age. ``````[define,person,[name:@string,age:@int]] // a person type [start,person[name:marko,age:29]] // a person instance [name:marko,age:29] [start,person[name:marko]] // a sub-type of person [name:marko,age:@int]`````` If a type has a data access path, then a reference can be created. The reference must be sufficiently populated to ensure that the data access path forms a proper instance (i.e. the path itself does not contain unbound values). ``````[define,person,[name:@string~,age:@int] => [[db][has,name,eq,$0]]]   // a person type
[start,person[name:marko]]                                           // a person reference
[get,age]                                                            // an age reference``````

Since age is not specified in the `person[name:marko]` reference instance, age is accessed via the concatenation of the data access path and `[get,age]`. The evaluation of the data access path is called dereferencing.

``````person[name:marko]                                          // a person reference
person[name:marko,age:![[db][has,name,eq,marko][get,age]]]  // implicit representation``````
 The data access path is the canonical means of accessing the referent via the model-ADT’s primary structure. However, in practice, depending on where the referent is stored (memory, disk, network, etc.) and how the respective mm-ADT object is linked to the storage system, the data access path may not be explicitly used.
##### Optimizations

References have a significant role in mm-ADT optimization and are discussed at length after mm-ADT bytecode is presented. In the meantime, a simple example is provided.

``````[define,person,[name:@string,age:@int]]
[define,people,person*
-> [has,age,@rel~,@int~] => person[age:$0($1)]*]
[put,people => [db]]``````

In the mm-ADT-bc source above, `people` is defined as a (sub)type of `stream` containing zero or more `persons`. The `people` type has an age-based `[has]` instruction pattern. If the processor invokes a matching `[has]` on a `people` instance, an anonymous stream object is returned that is guaranteed to contain zero or more `persons` with the respective age. This instruction definition is requesting that the storage system produce a direct path from stream to sub-stream in order to subvert the materialization of all the `person` instances in `people`. This is how database indices are modeled in mm-ADT.

It is a requirement of the storage system to provide a sublinear time access route to those `persons` of a particular age. If the storage system does not have the requisite secondary structures to support this "jump," then an error should be reported at `[define]`. In which case, a linear "scan and filter" is enacted by the processor. In other words, the processor performs the computation over the primary structure which entails iterating the entire `person` stream.

 The `person[age:$0($1)]*` is a dynamically created anonymous type whose data access path is the source reference data access path concatenated with the yielding instruction. Thus, the data access path is typed as `[[db][has,age,@rel~,@int~]]`. This data access path represents the logical, linear route to the data.

A reference instruction’s result must be equivalent to the result of applying the corresponding instruction to the original reference’s referent. This requirement is made salient in the diagram on the right, where applying `[has,age,lt,30]` to the `people` referent produces the same result as dereferencing the `person[age:lt(30)]*` reference. The two data access paths are diagrammed below.

 In diagrams, when an instance has a label that is of the form `symbol`, it denotes that the instance has the type `symbol`, not that the object is the type `symbol`.

References are related to other references via type instructions and the union of these relations is called the reference graph. The compile- and run-time analysis of mm-ADT reference graphs is the primary means by which model-ADT optimizations are realized. The various roles references serve in optimization are enumerated below.

• Constraints: Types match instances. Reference type instructions can generate constrained subtypes which ultimately provide information to the processor about the nature of the referent and the structures supporting its encoding. This is how complex database schemas such as those containing integrity constrains, indices, sort orders, and the like are modeled.

• Sub-streams: A stream refers to a sequence of objects. A stream that refers to a subset of a stream is a called a sub-stream reference. Database indices are modeled using sub-stream references.

• Proxies: Storage systems provide a shared memory structure that allows references with data access paths to be detached, serialized, and re-attached as needed. In this way, references serve as serializable proxies to objects in distributed database environments.

• Co-Locations: A reference’s value or data access path may contain information about the physical location of the referent. If so, references can migrate to their respective referent’s location and subsequently be processed by a local thread of execution. In this way, references provide data local processing via the co-location of structure and process in distributed database environments.

• Functions: A reference can refer to the result of a computation executed outside the mm-ADT processor. Storage system aggregation operations are modeled using function references.

Processor and Storage Communication

The storage system and processor communicate with one another through mm-ADT objects. This is what it means for these two components to be mm-ADT compliant. When the processor reads from an mm-ADT object, it is reading data stored in the storage system. When the processor writes to an mm-ADT object, it is writing to the storage system’s physical representation of that object.

mm-ADT reference objects serve a more interesting purpose. In a processor, a reference is a unformed instance with potential access to the object value. In the storage system, a reference points to a potential operation such as, for example, a simple value lookup, a full-table scan, or an index query. The act of dereferencing, causes the storage system to enact the respective operation which manifests the mm-ADT object which is now the fully populated referent of the processor’s respective reference.

At the semantic level, there is no difference between a reference and an instance. Both denote objects with bound values. It is possible (and advisable) that the storage system generate references as deemed appropriate according to its knowledge of its internal data structures. For example, if a stream contains only a few objects, then a scan and filter approach is typically more efficient than an index lookup. As such, at runtime, the storage system should provide a reference to the most appropriate computation.

``````[define,people,person*
-> [has,age,@rel~,@int~] => person[age:$0($1)]*
[put,people => [db]]
[has,age,lt,30]               // scan and filter or index lookup?``````

References do not dictate the means by which their data access paths are resolved. Instead, a processor simply moves from the tail of the data access path (the reference) to the head of the data access path (the referent) in "one step." It is this ambiguity around data access path resolution that allows storage systems to integrate auxiliary, secondary structures while remaining loosely coupled to the processor.

#### Database Object

Every storage system has a corresponding `db` object which can be accessed at anytime via the `[db]` instruction. The `[db]` instruction is an initial instruction in that it does not have an input stream. The `db` object that it emits to the output stream is the "root" of the storage system’s data structure. The typical mm-ADT type of the `db` object for common model-ADTs are presented below. While a `db` object can be a `record`, `list`, `stream`, etc. it is unique in that machine instructions can also be applied to it.

relational

`[(@string:@table)*]`

Tables as streams of rows indexed by table name

property graph

`vertex*`

Vertices as a stream

document

`[(@string:document*)*]`

Collections as streams of documents indexed by collection name

key/value

`[@key,@value]*`

Key/value tuples as a stream

rdf

`statement*`

Statements as a stream

wide-column

`[(@string:@table)*]`

Tables as streams of variable width rows indexed by table name

### Processes

Bytecode, also termed portable code or p-code, is a form of instruction set designed for efficient execution by a software interpreter. Unlike human-readable source code, bytecodes are compact numeric codes, constants, and references (normally numeric addresses) that encode the result of compiler parsing and performing semantic analysis of things like type, scope, and nesting depths of program objects.

— Wikipedia Article on Bytecode

mm-ADT operations are instructions drawn from the mm-ADT instruction set. Instructions are organized into bytecode. Bytecode represents a step-by-step, procedural process that maps a stream of input objects to a stream of output objects. Bytecode is generated by a query language compiler and evaluated by a processing engine. The processing engine’s operations must be Turing Equivalent to the model-ADT’s instruction set (a subset of the mm-ADT instruction set). In general, any stream-based processing engine supporting map, filter, reduce, and branch operations can execute any mm-ADT bytecode query.

#### Instructions

The mm-ADT instruction set is Turing Complete and can be used to generate mm-ADT bytecode that captures the semantics of any query written in any computable query language. To simplify query language compiler development, mm-ADT provides many high-level instructions that encapsulate common database query motifs such as pattern-matching, aggregation, selection/projection, and joins.

An mm-ADT instruction is a list containing a string opcode followed by zero or more object arguments.

inst `[@string,@object*]`

index

type

description

`0`

`string`

The instruction opcode

`1`

`object`

The instruction’s first argument

`2`

`object`

The instruction’s second argument

`3`

`object`

The instruction’s third argument

`…​`

`…​`

…​

`n`

`object`

The instruction’s nth argument

Instruction arguments can be instances known at compile-time or references computed at runtime. This distinction is made salient below, where the two instructions are semantically equivalent, thought the first instruction uses a `string` instance and the second instruction uses an anonymously typed reference whose data access path yields the `string` 'marko' for every incoming object.

``````[has,name,eq,marko]
[has,name,eq,string => [start,marko]]``````

A more practical example is provided below. In the first instruction, the `int` age is set at compile-time and remains static for the full execution of the query. In the second example, the `int` is computed for every incoming object. The latter example demonstrates context-dependent arguments as, for every incoming `person`, the `[has]` instruction will only allow to pass those who are older than their oldest friend.

``````[has,age,gt,25]
[has,age,gt,int => [[get,friends][get,age][max]]]``````

The above reference’s type is an anonymous subtype of `int`. The data access path states that instances are in correspondence with the maximum ages of friends (a subset of `int`). If the reference’s type can be determined from the data access path, then the type object is optional. Thus, the previous two references can be written as below, where `!` states that the `list` should be interpreted as runtime bytecode and not a `list` object.

``````[has,name,eq,![start,marko]]                    // N/A (initial) -> string
[has,age,gt, ![[get,friends][get,age][max]]]    // person        -> int``````

Assume the following bytecode. The `[start]` generates a reference to a `person` named marko. The `[obj]` creates a new record with two fields: name and popularity. The `![get,name]` reference does not require dereferencing the previously created `person` reference as the `person` has a name value. However, the popularity value requires a dereference of the friends stream to ultimately yield the count.

``````[db]
[define,person,[name:@string~,age:@int,friends:@person*] => [[db][has,name,eq,$0]]] [start,person[name:marko]] [obj,[name :![get,name], popularity:![[get,friends][count]]]]``````  The reference `![[get,friends][count]]` appears to have an unbound value — there is no `person` object for `[get,friends]` to consume. However, in fact, there is a bound `person` object. The reference’s type is defined anonymously within the `[obj]` instruction and thus, the `person` object is bound, ultimately, to `person[name:marko]`. If the reference were to be isolated from its context, it would be a type as `[get,friends]` is not an initial instruction and thus, the data access path would be unresolvable. #### Bytecode An instruction denotes a single, atomic operation. Bytecode represents a series of such operations which, in aggregate, compute the results of a query. Structurally, bytecode is a list of instructions. As previously demonstrated, instructions can contain bytecode arguments. Therefore, bytecode is a nested, acyclic tree data structure. Unnested bytecode is called a query. Nested bytecode is called a subquery. Bytecode queries are always written according to the data access paths of the models-ADT’s primary, logical structure. If a particular region of the primary structure has a supporting secondary structure (e.g. an index or a denormalization), then secondary data access paths are made available by the storage system via references. By remaining agnostic to optimizing structures, mm-ADT compliant query language compilers can be used with any mm-ADT compliant processor and storage system. bc `[@inst*] <=> [[@string,@object*]*]` index type description `0` `inst` The first instruction `1` `inst` The second instruction `2` `inst` The third instruction `3` `inst` The forth instruction `…​` `…​` …​ `n` `inst` The (n-1)th instruction The instruction `[has,name,eq,![get,name]]` contains a bytecode argument. This argument is actually a single instruction. The mm-ADT-bc compiler will automatically nest single instructions when the context is known to accept bytecode. Thus, `![get,name]` is translated to `![[get,name]]`. There are three contexts in which bytecode is known to be the accepted argument. context example description `=>` `=> [has,name,eq,39]` Instructions can only map to bytecode or types `@bc` `[branch,[get,name],[get,age]]` Bytecode may be explicitly required in a pattern match `!` `![get,name]` Bytecode computes object at runtime `!!` `!![op,add,1]` Bytecode computes object at compile-time ##### Bytecode Source Bytecode source is used in two situations: modeling and querying. ###### Modeling Every mm-ADT compliant storage system must publish a model-ADT describing its primary, logical structure. The definition is an embedding into mm-ADT types. A relational-ADT is defined below, where the `[model]` instruction’s second argument denotes the embedding’s source and target ADTs. That is, `rdb=>mm` states that a model denoted `rdb` is being defined in terms of `mm` (mm-ADT). The subsequent `[define]` instructions specify `rdb` types and their representation in mm-ADT. The storage-ADT as an mm-ADT embedding ``````// relational-ADT [model,rdb=>mm, [define,row,[@string:@bool|@string|@number|@bytes]], [define,table,row*], [define,db,[(@string:@table)*]]]`````` A domain-specific social-ADT is defined below in terms of the previous relational-ADT. The social-ADT provides a proper, monomorphic sub-model of the relational-ADT. Thus, the embedding is a simple constraint. This is how database schemas are modeled. A schema as a relational-ADT embedding ``````// social-ADT [model,social=>rdb, [define,person,row[name:@string,age:@int,company:@string]], [define,organization,row[name:@string,founded:@int]], [define,people,table<@person*>], [define,organizations,table<@organization*>]]`````` The storage system’s model-ADT may not be a proper super-set of the user’s or query language’s model-ADT. In such situations, a more complex embedding needs to be defined. A key/value-ADT is embedded in the storage system’s relational-ADT below. A key/value-ADT as a relational-ADT embedding ``````// key/value-ADT [model,kv=>rdb, (1) [define,key,@number], (2) [define,value,@string], (3) [define,keyvalue,[@key~,@value~] (4) => (row[key:$0,value:$1] => [[db][get,kv][has,key,eq,$0]])               (5)
-> [put,0,@key]          => [error]                                      (6)
-> [put,1,@value~]       => [put,value,$0] (7) -> [get,0] => [get,key] (8) -> [get,1] => [get,value]], (9) [define,db,keyvalue* => [[db][get,kv]] (10) -> [new] => [[db][put,kv, (table<row[key:@key,value:@value]*> => [[db][get,kv]] (11) -> [has,key,eq,@key~] => row[key:$0]?)]],    (12)
-> [dedup,[get,0]]               => []                                   (13)
[       @inst*~a,[_]] => [$a] // a*1=a`````` ###### Distributivity The right and left distributivity of multiplication over addition can be expressed in mm-ADT bytecode. ``````[branch,[@inst*~a,@inst+~c], [@inst*~b,@inst+~c]] => [[branch,[$a],[$b]][$c]]   // (a*c)+(b*c)=(a+b)*c
[branch,[@inst+~a,@inst*~b],
[@inst+~a,@inst*~c]] => [[$a][branch,[$b],[$c]]] // (a*b)+(a*c)=a*(b+c)`````` ###### Binomials and Multinomials A binomial is any structure of the form (a+b)^n. A multinomial generalizes a binomial. Bi- and multinomials can be expanded in order to transform a multiplication of additions into an addition of multiplications. The multinomial below is presented with its expanded form. Expansions can be solved using the "foil method." From the perspective of streams, the concatenation of two 2-way branches can be expanded into an equivalent single 4-way branch. (a+b)(b+c) = ab + ac + b^2 + bc ##### The Coefficient Ring The structure <<bb C,\+,\cdot>> is called a coefficient ring. The coefficient ring can be any ring with unity. Rings are prevalent and can be found in the integers and floats with numeric addition and multiplication, complex numbers with complex addition and multiplication, matrices with matrix addition and multiplication, etc. The model-ADTs of all common databases make use of an integer-based coefficient rings. However, mm-ADT is poised to open up new model-ADTs. For example, a quantum model-ADT uses complex matrix coefficients (see Section 5.H). The universal functional commutativity of coefficients states that coefficients can be moved right or left and aggregated with multiplication without altering the semantics of the computation (see Theorem 6 and its respective corollaries). ##### The Stream Ring mm-ADT bytecode is a stream ring. A stream ring is the product of the function ring and a chosen coefficient ring and is denoted <<bb C bb F, \+, \cdot >>, where bb C bb F = bb C \times bb F = \{(c,a): c \in bb C \wedge a \in bb F\}. In every tuple (c,a) \in bb C bb F, the first element is a coefficient and the second element is a function. The stream additive operator is defined as ca + db = {((c+d)a,"if " a=b),(ca+db,"otherwise."):} The stream multiplicative operator is defined as ca \cdot db = (c \cdot d)(a \cdot b), where the first \cdot is the coefficient ring’s multiplicative operator and the second \cdot is the function ring’s multiplicative concatenation operator. ###### Stream Ring Axioms Stream ring theory defines seven axioms (see Section 3.C). {: (x ~ y = <<cx>> = <<cy>>,""), (<<cx,dy>> = <<dy,cx>>,""), (<<cx,dx>> = <<(c+d)x>>, "(bulk axiom)"), (<<cx>>da = <<(c \cdot d)a(x)>>,"(apply axiom)"), (<<cx>>(da + eb) = (<<cx>>da) + (<<cx>>eb), "(split axiom)"), (<<cx>> + <<dy>> = <<cx,dy>>, "(merge axiom)"), (<<0x>> = <<c\emptyset>> = << >>,"") :} Equality If two objects `x` and `y` are equivalent, then they can replace each other in a stream. Equality is defined as having the same behavior across all the model-ADT’s instructions. Order Streams are unordered and thus, two streams with the same two objects are equal streams. Bulking Two equal objects in the same stream can be combined into a single object, where the coefficients of the individual objects are summed together according to the coefficient ring’s additive operator. Apply When an object is applied to an instruction, the resulting object’s coefficient is equal to the object’s coefficient multiplied by the instruction’s coefficient as defined by the coefficient ring’s multiplicative operator. Split/Merge When an object encounters a branch in the stream, it is copied to each branch. When two streams are merged, the objects of each stream are put into a single stream. Zero If an object has a zero coefficient (as defined by the coefficient ring), it can be removed from the stream. If an object is an empty stream with a coefficient, it can be removed from the stream. Streams are atemporal. There are no requirements to the order in which functions are applied, streams are merged, and objects are bulked (see Theorem 5). The consequence of this theorem is that process graphs can be evaluated in a lazy manner with depth-first semantics (save space), breadth-first semantics (save time), or any arbitrary hybrid of the two. Different query space/time requirements will demand different execution semantics, where real-time queries touching a small subset of the storage system’s data will typically use depth-first semantics while batch-time queries touching large subsets of the storage system’s data will typically use breadth-first semantics. The mm-ADT Address Spaces An mm-ADT database virtual machines operates in two address spaces: the storage system’s and the processing engine’s address spaces. Storage System The storage system provides a shared memory structure that all mm-ADT machine’s can access and mutate according to the storage system’s transaction semantics. Storage system addresses are mapped to mm-ADT data access paths. When an mm-ADT processor dereferences a data access path, it is up to the storage system to create an mm-ADT object that is populated by and linked to data that may be on another physical machine, serialized on disk, or directly accessible in memory. Processing Engine The processing engine supports two memory structures: a shared memory structure accessible via `[db]` and many shared-nothing memory structures called object threads. The `db` object contains type definitions and other machine metadata. It is primarily a read-only structure as writes require global synchronization. An object thread contains computational metadata such as the current stream object, previous thread states (i.e. path history), a loop stack, etc. Object threads represent isolated threads of execution flowing through the processor’s streams. #### Instruction Set The mm-ADT instruction set is a set of instruction types. For example, the instruction `[get,@object]` is a type as it contains unbound values. This instruction does not exist in the relational-ADT. Instead, the subtype `[get,@string]` is supported by the relational-ADT. When bytecode is generated by a query language compiler, instances of the supported types are created as in `[get,name]`.  The mm-ADT instruction set is a set of instruction types. Instances of the model-ADT’s supported types are used to create bytecode to process the model-ADT’s structure. There are 10 kinds of instructions. • Flatmap Instructions: transforms the input object to zero or more output objects. (one-to-many) • Map Instructions: transform the input object to one output object. (one-to-one) • Filter Instructions: transforms the input object to no object or the input object. (one-to-one or none) • Barrier Instructions: transforms zero or more input objects to zero or more output objects. (many-to-many) • Reduce Instructions: transforms zero or more input objects to a single output object. (many-to-one) • Initial Instructions: generate objects on the output stream. (none-to-many) • Terminal Instructions: consume objects on the input stream. (many-to-none) • Side-Effect Instructions: create, update, or delete an object. • Machine Instructions: interacts with the `db` object and other features of the mm-ADT machine. • Branch Instructions: clones the input object to zero more forks/splits/branches in the bytecode program structure. The map, flatmap, and filter instructions form a function ring. The barrier and reduce functions form a near-ring (see Theorems 16-19). Finally, branch instructions are "structural" in that they are used to shape the process graph and, in of themselves, are not stream functions, but instead, they represent the additive stream operator. ##### Flatmap Instructions opcode description domain arguments range input instruction output `flatmap` `Put the incoming object and the internal stream and drain the internal stream to the output stream.` `@object` `@bc` `@object+` `[marko,29]` `[flatmap,[unfold]]` `marko 29` `unfold` `Stream the incoming objects internal objects` `@object` `@object+` `[marko,29]` `[unfold]` `marko 29` `[marko,[29,42]]` `[unfold]` `marko [29,42]` `42` `[unfold]` `42` `[name:marko,age:29]` `[unfold]` `[name,marko] [age,29]` ###### Map Instructions The map ring <<bb F_m,\+,\cdot>> contains the set of all map instructions. Map instructions transform one input object to one output object. If the map function a: X \rightarrow Y is injective, then for every x \in X, if a(x_1) = a(x_2), then x_1 = x_2. If the map function a is surjective, then for all x \in X every y \in Y is mapped to. Finally, if a is both injective and surjective, then a is bijective and there exists a multiplicative inverse a^(-1) \in bb F_m such that a \cdot a^(-1) = a^(-1) \cdot a = 1. Thus, the map instruction a defines an isomorphism between the instances of the types X and Y and the following axioms and theorems for multiplicative inverses can be used in compilation. {: (a \cdot b = a \cdot c \implies b = c), (b \cdot a = c \cdot a \implies b = c), (a \cdot b = 1 \implies a = b^(-1) \wedge b = a^(-1)), ((a \cdot b)^(-1) = b^(-1) \cdot a^(-1)), ((a^(-1))^(-1) = a) :} opcode description domain arguments range input instruction output `map` `Put the incoming object onto the internal stream and emit the first output to the output stream.` `@object` `@bc` `@object` `[name:marko,age:29]` `[map,[[unfold][count]]]` `2` `[name:marko,age:29]` `[map,[count]]` `1` `[3,2,1]` `[map,[unfold]]` `3` `get` `Get the object indexed by the specified key. For lists, the keys are integers. For records, the keys are field keys.` `@list|@record` `@object*` `@object*` `[name:marko,age:29]` `[get,name]` `marko` `[name:marko,age:29]` `[get,name,age]` `marko 29` `[42,6,92]` `[get,0]` `42` `[creator:[name:marko,age:29]]` `[get,creator]` `[name:marko,age:29]` `loops` `The number of times the object thread has repeated the respective loop section` `@object` `@string?` `@int` `stephen` `[loops]` `3` `kuppitz` `[loops,i]` `4` `obj` `Emit the argument object for each incoming object` `@object` `@object` `@object` `stephen` `[obj,2]` `2` `[1,2,3]` `[obj,[![get,0],42]]` `[1,42]` `path` `The path object of the object thread. Bytecode is evaluated on the path objects in a round-robin manner.` `@object` `[@string*]?,[@bc*]?` `@path` `marko` `[path]` `[[db][name:marko][marko]]` `marko` `[path,[b]]` `[b:[name:marko]]` `marko` `[path,[b,c]]` `[b:[name:marko],c:marko]` `marko` `[path,[b,c],[[type]]]` `[b:person,c:string]` `subrec` `Select a subset of the fields in the record and process the values in a round-robin manner` `@record` `[@object*],[@bc*]?` `@record` `[a:[1,1],b:2,c:3]` `[subrec,[a,b],[[size]]]` `[a:2,b:1]` `[a:[1,1],b:2,c:3]` ```[subrec,[a,b], [[size],[type]]]``` `[a:2,b:int]` `type` `Get the declared type of the object` `@object` `@type` `42` `[type]` `int` `[42,6,92]` `[type]` `list` `wallet[1,5,20,20]` `[type]` `wallet` `[name:marko,age:29]` `[type]` `record` `person` `[type]` `type` ###### Filter Instructions The filter ring <<bb F_f,\+,\cdot>> contains the set of all filter instructions. A filter instruction is abstractly defined by a predicate p : X \rightarrow \{bb(true),bb(false)\}, such that if a \in bb F_f and x is a stream object, then a(x) = { (x,"if " p(x)=bb(true)),(\emptyset,"otherwise.") :} The filter ring is unique in that its multiplicative monoid <<bb F_f,\cdot>> is both idempotent and commutative. The idempotent property means that if any filter instructions repeat within a sequence of filters, all but one can be removed without effecting the semantics of the computation. Moreover, the commutative property guarantees that a sequence of filter instructions can be arranged in any order. For instance, if a,b \in bb F_f, then {: (a \cdot a,=, a,"(idempotent)"), (a \cdot b,=,b \cdot a,"(commutative)".) :} The commutative property allows filter instructions to be sorted by some features such as their cost (computational complexity) and/or their selectivity. In theory, an inexpensive, highly selective filter should be the first filter applied in a sequence. A bytecode optimization strategy based on filter ordering makes use of the filter ring’s total preorder (see Section 5.B). opcode description domain arguments range input instruction output `filter` `Filter the incoming object is the argument has no object` `@object` `@bc` `@object?` `[42,6]` `[filter,[[unfold][sum][is,eq,48]]]` `[42,6]` `[42,6]` `[filter,[[unfold][sum][is,eq,49]]]` `[name:josh]` `[filter,[#]]` `[name:josh]` `[filter,[_]]` `[name:josh]` `#` `Always filter the incoming object` `@object` `@object` `[name:marko,age:29]` `[#]` `42` `[#]` `[1,2,marko]` `[#]` `_` `Never filter the incoming object` `@object` `@object` `[name:marko,age:29]` `[_]` `[name:marko,age:29]` `42` `[_]` `42` `[1,2,marko]` `[_]` `[1,2,marko]` `and` `Filter the incoming object if any of the arguments have no object` `@object` `@bc{2,}` `@object` `42` `[and,[is,lt,43],[is,gt,5]]` `42` `[kuppitz,marko]` `[and,[_],[[unfold][count][is,eq,2]]]` `[kuppitz,marko]` `[kuppitz,marko]` `[and,[#],[[unfold][count][is,eq,2]]]` `coin` `Filter the incoming object based on a biased coin toss` `@object` `@floating` `@object?` `2 7 marko` `[coin,0.5]` `2 marko` `2 7 marko` `[coin,0.0]` `2 7 marko` `[coin,1.0]` `2 7 marko` `has` `Filter the incoming container if it its index value doesn’t satisfy the predicate` `@list|@record` `@object,@pred,@object` `(@list|@record)?` `[name:marko,age:29]` `[has,name,regex,m.*]` `[name:marko,age:29]` `[name:josh]` `[has,age,gt,25]` `[1,2,marko]` `[has,2,eq,marko]` `[1,2,marko]` `hasKey` `Filter the incoming container if it doesn’t have a key that satisfies the predicate` `@list|@record` `@object,@pred,@object` `(@list|@record)?` `[name:marko,age:29]` `[hasKey,regex,na.*]` `[name:marko,age:29]` `[name:josh]` `[hasKey,eq,age]` `[1,2,marko]` `[hasKey,gt,1]` `[1,2,marko]` `hasValue` `Filter the incoming container if it doesn’t have a value that satisfies the predicate` `@list|@record` `@object,@pred,@object` `(@list|@record)?` `[name:marko,age:29]` `[hasValue,regex,mar.*]` `[name:marko,age:29]` `[name:josh]` `[hasValue,eq,42]` `[1,2,marko]` `[hasValue,gt,1]` `[1,2,marko]` `is` `Filter the incoming object if it doesn’t satisfy the predicate` `@object` `@pred,@object` `@object?` `true` `[is,eq,true]` `true` `[name:marko]` `[is,eq,[name:marko]]` `[name:marko]` `[kuppitz,gremlin]` `[is,typeof,float]` `[kuppitz,gremlin]` `[is,typeof,[object{2}]]` `[kuppitz,gremlin]` ```[name:kuppitz, stuff: [1,[2,3],no]]``` ```[is,typeof, [name:@string,stuff:[@int,[@int{2}],yes|no]]``` ```[name:kuppitz, stuff: [1,[2,3],no]]``` `or` `Filter the incoming object if all of the arguments have no object` `@object` `@bc{2,}` `@object` `42` `[or,[is,lt,43],[is,gt,5]]` `42` `[kuppitz,marko]` `[or,[_],[[unfold][count][is,eq,3]]]` `[kuppitz,marko]` `[kuppitz,marko]` `[or,[#],[[unfold][count][is,eq,2]]]` `[kuppitz,marko]` ##### Barrier Instructions opcode description domain arguments range input instruction output `barrier` `Aggregate the entire input stream, process it, and yield an output stream` `@object*` `@bc?` `@object*` `42 6 92` `[barrier]` `42 6 92` ```[name:marko] [name:josh] [name:stephen]``` `[barrier,[get,name]]` `marko josh stephen` `dedup` `Deduplicate the input stream based on some feature of the objects` `@object*` `@bc*` `@object*` `42 6 6 92` `[dedup]` `42 6 92` ```[name:marko] [name:marko] [name:marko,age:29] [name:josh,age:29]``` `[dedup]` ```[name:marko] [name:marko,age:29] [name:josh,age:29]``` ```[name:marko] [name:marko] [name:marko,age:29] [name:josh,age:29]``` `[dedup,[get,name]]` ```[name:marko] [name:josh,age:29]``` `[42,a,1][42,a,2][6,a,2]` `[dedup,[get,0],[get,1]]` `[42,a,1][6,a,2]` `[42,a,1][42,a,2][6,a,2]` `[dedup,[get,1],[get,2]]` `[42,a,1][42,a,2]` `order` `Order the input stream based on some feature of the objects and relational order. Ascending is gt and descending is lt.` `@object*` `[lt|gt,@bc?]+` `@object*` `42 6 6 92` `[order,[lt]]` `92 42 6 6` ```[name:vadas,age:27] [name:josh,age:32] [name:marko,age:29]``` `[order,[gt,[get,name]]]` ```[name:vadas,age:27] [name:marko,age:29] [name:josh,age:32]``` ```[name:vadas,age:29] [name:josh,age:32] [name:marko,age:29]``` ```[order, [lt,[get,age]], [gt,[get,name]]]``` ```[name:josh,age:32] [name:marko,age:29] [name:vadas,age:29]``` `range` `Filter out all objects in the input stream not within the numeric range` `@object*` `@integer{2}` `@object*` `42 6 92` `[range,0,2]` `42 6` `42 6 92` `[range,2,3]` `92` `42 6 92` `[range,2,100]` `92` `sample` `Filter a specified number of input objects biased by some feature of the objects` `@object*` `@integer,@floating?` `@object*` `tail` `Filter those objects not at the end of the input stream` `@object*` `@integer` `@object*` ###### Reduce Instructions The reduce near-ring <<bb F_r,\+,\cdot>> contains the set of all reduce instructions. Multiplication is not right distributive over addition. However, multiplication is left distributive over addition. Reduce instructions are temporal in that all previous instructions must have produced all their outgoing objects before a reduce instruction can process its input stream. Reduce instruction introduce synchronization barriers into the stream. There exists a subset of reduce functions called the monoidic reduce functions. These functions are idempotent and semi-right distributive (see Section 5.D). opcode description domain arguments range input instruction output `reduce` `Reduce the input stream according to the specified object` `@object*` `@object` `@object` `count` `Count the number of objects in the input stream` `@object*` `@long` `42 6 92` `[count]` `3` `[42,6,92]` `[count]` `1` `[count]` `0` `fold` `Put the mutating argument object onto the nested bytecode’s input stream for each incoming object` `@object*` `@object,@bc` `@object` `1 2 3` `[fold,0,[op,add,![_]]]` `6` `marko 42 kuppitz` `[fold,[:],[put,![_],![_]]]` `[marko:marko,42:42,kuppitz:kuppitz]` `marko 42 kuppitz` `[fold,[],[fit,![_]]]` `[marko,42,kuppitz]` `group` `Group the input stream objects by a key and then reduce the values` `@object*` `@bc{3}` `@record` `[42,1][42,5][6,2]` ```[group, [get,0], [get,1], [sum]]``` `[42:6,6:2]` `[42,1][42,5][6,2]` ```[group, [get,0], [obj,1], [sum]]``` `[42:2,6:1]` `42 42 6` ```[group, [_], [obj,1], [sum]]``` `[42:2,6:1]` `marko kuppitz josh` ```[group, [obj,name], [_], [fold,[]]]``` `[name:[marko,kuppitz,josh]]` `max` `Reduce the number input stream to the largest number` `@number*` `@number` `42 6 92` `[max]` `92` `0.1 55 109.2` `[max]` `109.2` `mean` `Reduce the number input stream to the mean` `@number*` `@number` `42 6 92` `[mean]` `46` `0.1 55 109.2` `[mean]` `91.1` `median` `Reduce the number input stream to the median` `@number*` `@number` `42 6 92` `[median]` `42` `0.1 55 109.2` `[median]` `55` `min` `Reduce the number input stream to the smallest number` `@number*` `@number` `42 6 92` `[min]` `6` `0.1 55 109.2` `[min]` `0.1` `sum` `Reduce the number input stream to the sum` `@number*` `@number` `42 6 92` `[min]` `140` `0.1 55 109.2` `[min]` `164.3` ###### Initial Instructions opcode description domain arguments range input instruction output `db` `Emits the current model-ADT’s storage system` `@object` `[db]` `d[persons:people=>[[db][get,persons]]]b` `start` `Emits the argument objects` `@object*` `@object*` `[start,1,2,3]` `1 2 3` `[start,[1,2,3]]` `[1,2,3]` ###### Terminal Instructions opcode description domain arguments range input instruction output `drop` `Drop the incoming object at its data access path` `@object` `[name:marko]` `[drop]` ##### Side-Effect Instructions opcode description domain arguments range input instruction output `sideeffect` `Evaluate the bytecode argument and emit the incoming object` `@object` `@bc` `@object` `[marko,stephen]` `[sideeffect,[[db][drop]]]` `[marko,stephen]` `as` `Label the current object in object thread’s path history` `@object` `@string` `@object` `kuppitz` `[as,a]` `kuppitz` `drop` `Remove the object from the structured sequence by its associated key` `@list|@record` `@object` `@list|@record` `[marko,42,29]` `[drop,1]` `[marko,29]` `[name:marko,age:29]` `[drop,name]` `[age:29]` `fit` `Insert the object into the list at the specified index location` `@list` `@int?,@object` `@list` `[marko,29]` `[fit,1,42]` `[marko,42,29]` `[marko,29]` `[fit,42]` `[marko,29,42]` `put` `Add or replace the object with the provided key/value` `@list|@record` `@object{2}` `@list|@record` `[marko,29]` `[put,0,kuppitz]` `[kuppitz,29]` `[name:marko,age:29]` `[put,name,kuppitz]` `[name:kuppitz,age:29]` `[name:marko]` `[put,age,29]` `[name:marko,age:29]` ###### Machine Instructions opcode description domain arguments range input instruction output `cost` `The cost of executing the instruction on the type, instance, or reference object` `@object` `@inst` `@cost` `migrate` `Migrate the incoming object to the db object at the respective data access path` `@object` `send,recv` `@object?` `vertex[id:1]` `[migrate,send]` `vertex[id:1]` `[migrate,recv]` `vertex[id:1]` `define` `Associate an object (typically a type) with a symbol` `@db` `@string,@object` `@db` `d[]b` `[define,age,int<gt(0)>]` `d[]b` `deref` `Force the dereference of the incoming object` `@object` `@object` `vertex[id:1]` `[deref]` `vertex[id:1,b:2,c:3]` `explain` `Get a description of the bytecode compilation` `@string` `[explain]` `[db][get,V]=>[db][get,1]` `model` `Define an embedding from a source-ADT to a target-ADT` `@db` `@string,[define,@string,@object]+` `@db` `d[]b` `[model,pg⇒mm,[define,db,vertex*]]` `d[]b` `ref` `Create a reference to the incoming object` `@object` `@type?` `@object` `vertex[id:1,b:2,c:3]` `[ref]` `vertex[id:1]` `vertex[id:1,b:2,c:3]` `[ref,vertex[id:@int,b:@int]]` `vertex[id:1,b:2]` `vertex[id:1,b:2,c:3]` `[ref,vertex]` `vertex[id:1]` ##### Branch Instructions opcode description domain arguments range input instruction output `branch` `Copy the incoming stream to each branch stream and then merge to a single outgoing stream` `@object*` `@bc{2,}` `@object*` `1 2` `[branch,[_],[_]]` `1 2 1 2` `1 2` `[branch,[_],[is,gt,1]]` `1 2 2` `at` `Access the object of the first bytecode argument and then pass that object to the second bytecode argument, where all nested bytecode arguments have the incoming object as their source` `@object*` `@bc{2}` `@object*` `[1,[2,[3],4],5]` ```[at,[[get,1][get,1]], [fit,3.5]]``` `[1,[2,[3,3.5],4,5]]` `choose` `Map the incoming object using the first bytecode and propagate it down the bytecode patterns it matches` `@object` `@bc,[@bc,@bc]+` `@object*` `[name:marko,age:29]` ```[choose,[get,age], [[is,lt,30], [obj,young]], [[is,gte,30],[obj,old]]]``` `young` `coalesce` `Propagate the incoming object down the first bytecode argument that has an outgoing object` `@object` `@bc{2,}` `@object*` `ifelse` `If the incoming object satisfies the first argument, then pass to the second, else to the third` `@object*` `@bc{3}` `@object*` `1 2 3` ```[ifelse,[is,gt,1], [op,sub,1], [op,add,1]]``` `2 1 2` `repeat` `The r-bytecode is the loop. The e-bytecode is the emit predicate. The b-bytecode is the break predicate.` `@object*` `[r|e|b,@bc]{1,3}` `@object*` `1` ```[repeat, [r,[op,add,1]], [e,[is,gt,7]], [b,[is,eq,10]]]``` `8 9 10` `1` ```[repeat, [r,[op,add,1]], [b,[is,eq,10]]]``` `10` `1` ```[repeat, [r,[op,add,1]], [b,[[loops][is,eq,9]]]]``` `10` `1` ```[repeat, [e,[_]], [r,[op,add,1]], [b,[is,eq,3]]]``` `1,2,3,3` ### Secondary Structures Query optimization is a function of many database management systems. The query optimizer attempts to determine the most efficient way to execute a given query by considering the possible query plans. Generally, the query optimizer cannot be accessed directly by users: once queries are submitted to the database server, and parsed by the parser, they are then passed to the query optimizer where optimization occurs. However, some database engines allow guiding the query optimizer with hints. — Wikipedia Article on Query Optimization A model-ADT’s primary structure is its logical model. Bytecode is always written in terms of the objects that compose the logical model. Thus, from a processor’s perspective, it appears that it is always constrained to the data access paths offered by the linkages between the objects of the primary structure. However, unbeknownst to the processor, certain objects called references behave identically to their referent counterparts in every way save that when certain instructions are applied to them, auxiliary data access paths are exposed that exist outside the primary structure and amongst any number of secondary structures. Secondary structures allow the processor to "teleport" to other regions of the primary structure — exiting and entering as dictated by the references encountered along the way. Secondary structures optimize common, idiomatic data access patterns. Example secondary structures include: schemas, indices, denormalizations, views, functions, and proxies. This section will present the way in which common secondary structures are incorporated in mm-ADT. #### References Revisited A processor traverses a model-ADT’s object graph by executing instructions. When certain instructions are applied to certain objects, a reference object is yielded. In the example above, when `[f]` is applied to the `atype` instance, the processor moves to a `btype*` stream reference. If the processor were to dereference the `btype*` reference it would be directed to the head of the `btype*` stream. However, if the processor does not dereference `btype*` and if the processor’s next instruction is `[g]`, then a new reference is accessed that, when dereferenced, yields a single `ctype` instance. Instead of processing the entire `btype*` stream with `[g]`, the reference graph (the graph of instruction linkages between references) enables the processor to jump to the result of `[g]` avoiding the manifestation of objects in the `btype*` stream. All secondary structures are modeled in this fashion. ##### Primary Structure Paths Assume a relational-ADT storage system with the following table instance. ``````CREATE TABLE persons ( name varchar(255) NOT NULL, age int NOT NULL)`````` The SQL `CREATE TABLE` statement is expressed in mm-ADT-bc as below. ``````[db] [define,person,row[name:@string,age:@int]] // a person value type [define,people,table<@person*>]] // a people value type [put,persons,people => [[db][get,persons]]] // an empty people stream`````` Suppose the persons-table is populated with a significant number of `person` objects (on the order of thousands). ``````[db][get,persons] [put,person[name:a,age:1]] [put,person[name:b,age:2]] ... [put,person[name:z,age:26]]`````` Next, suppose the following query is evaluated by a processor. The table below provides the output for each instruction. ``````[db] // [persons:people => [[db][get,persons]]] [get,persons] // people => [[db][get,persons]] // person[name:a,age:1] ... person[name:z,age:26] (iteration required) [has,age,gt,21] // person[name:v,age:22] ... person[name:z,age:26] [get,name] // v w x y z [dedup] // v w x y z [count] // 5L`````` The instruction `[get,persons]` moves the processor to the head of the `people` stream. When `[has,age,gt,21]` is evaluated, `people` is iterated, manifesting the individual `person` objects of the `people` stream. These `person` objects are individually processed by `[has,age,gt,21]`. Those `persons` not filtered have their name-value accessed by `[get,name]`. The `string` names are then deduplicated via the barrier instruction `[dedup]`. The outputted unique `string` names are then reduced to a final `long` value by `[count]`. The processor’s path through the data required the manifestation of every `person` object and the subsequent analysis of a subset of those objects. The processor’s path was completely contained within the primary structure of the relational-ADT. ##### Secondary Structure Paths Secondary structures can be defined and leveraged during compile- and run-time processing. The previous example is extended using a richer table definition. ``````CREATE TABLE persons ( name varchar(255) NOT NULL UNIQUE, age int NOT NULL) CREATE INDEX persons_age_idx ON persons(age)`````` The table above can also be created in mm-ADT-bc as shown below. There is no way to explicitly create an index with mm-ADT. Instead, an index is created implicitly as the `[has,age,@rel,@int]` instruction mapping tells the storage system it would like a direct reference from a `people` stream to `persons` of a certain age. The storage system can manifest that indirection using any appropriate secondary structure such as an index. Next, `[dedup,[get,name]]` maps to a no-op. This instruction mapping tells the storage system that no two `person` objects have the same name. To guarantee that the `people` instance’s semantics are upheld, the storage system should apply a uniqueness constraint. ``````[define,person,row[name:@string,age:@int]] [define,people,table<@person*> -> [has,age,@rel,@int] => person[age:$0($1)]* -> [dedup,[get,name]] => []] [put,persons,people => [[db][get,persons]]]`````` The output of each instruction from the previous query is provided below. ``````[db] // [persons:people => [[db][get,persons]]] [get,persons] // people => [[db][get,persons]] // -> [has,age,@rel,@int] => person[age:$0($1)]* // -> [dedup,[get,name]] => [] [has,age,gt,21] // person[age:gt(21)]* => [[db][get,persons][has,age,gt,21]] // -> [get,name] => string* // -> [dedup,[get,name]] => [] [get,name] // string* => [[db][get,persons][has,age,gt,21][get,name]] // -> [dedup] => [] // -> [count] => long [dedup] // string* => [[db][get,persons][has,age,gt,21][get,name]] // -> [dedup] => [] // -> [count] => long [count] // long => [[db][get,persons][has,age,gt,21][get,name][count]]`````` The storage system has access to secondary structures and is able to expose a data access path from the `people` stream to those `persons` of a certain age. In other words, the storage system has direct access to "young people" (via the `persons_age_idx` index) and thus, `[has,age,gt,21]` does not require a linear scan of the `people` stream, which would materialize `person` instances. The `[get,name]` instruction requires the names of these young people. The "young people" reference provides an instruction match as the index contains name data (presumably). While the `[dedup]` instruction requires names, the schema states that all names are unique (e.g. via the `UNIQUE` keyword). Thus, `[dedup]` is a no-op and can be safely ignored. Finally, given that names are unique, the number of unique names amongst young people is equivalent to the number of young people. The original `[has,age,gt,21]`-based index query’s count is provided by the storage system via a `long` reference. In this way, the processor only manifests a single `long` value: `5L`. Unlike the previous example, the processor is able to jump from `people` to `5L` — subverting the primary structure’s data access paths by way of references which expose secondary index and schema structures. ##### The Reference Graph The set of all model-ADT references and their relationship to one another via type instructions forms a directed, binary, multi-relational reference graph. The diagram below extends the example from the previous section. • Vertices are references. • Vertex labels denote the reference type. • Edges are instructions. • Edge labels denote the instruction. • The tail vertex is the instruction’s source object. • The head vertex is the instruction’s target object.  Every reference graph commutes. Regardless of the path taken, the destination reference will contain the same referent. The commutativity of reference graphs enables bytecode optimizers to rewrite certain bytecode sequences into semantically equivalent (i.e. yielding the same referents), though computationally more efficient representations. For instructions with bytecode arguments, runtime instruction matching provides processor redirections.  The better secondary structures are at keeping the processor in the reference graph avoiding dereferencing and thus, referent object materialization, the more optimal the model-ADT embedding. #### Common Database Features Database systems share many features in common. These features are captured by mm-ADT in a ADT-agnostic manner. The subsections to follow identify a particular database feature and explain how to represent that feature in mm-ADT. All the examples to follow assume a relational-ADT and SQL. ##### Schemas A database system is designed to effectively encode and process a particular model-ADT. Example model-ADTs include a relational-ADT, a graph-ADT, a document-ADT, etc. If the database supports user-defined schemas, then users are able to create domain-specific model-ADTs that are ultimately a monomorphic embedding into the database’s model-ADT. This section will go through various `CREATE TABLE` examples for creating a domain-specific model-ADT over a relational-ADT in SQL. Each example will discuss the corresponding mm-ADT-bc representation. ###### Record Types The SQL statement below creates a persons-table where every person row has an optional name and age field. This is represented in mm-ADT-bc using the zero-or-one `?` quantifier on each field. ``````CREATE TABLE persons ( name varchar(255), age int, )`````` ``[db][put,persons,[name:@string?,age:@int?]* => [[db][get,persons]]]`` ###### Primary Keys A primary key is a collection-specific, uniquely identifying field (or set of fields). This means that if the collection is processed by a `[dedup]` based on the primary key field, a no-op should occur. ``````CREATE TABLE persons ( name varchar(255), age int, PRIMARY KEY (name) )`````` ``````[db][put,persons,[name:@string,age:@int?]* => [[db][get,persons]] -> [dedup,[get,name]] => []]`````` ###### Sort Orders The rows of a table can be ordered such that upon access, the resultant stream returns its objects in sort order. This is specified in mm-ADT-bc as an `[order]` rewrite to `[]` on the particular key/order defined. ``````CREATE TABLE persons ( name varchar(255), age int, PRIMARY KEY (name,ASC) )`````` ``````[db][put,persons,[name:@string,age:@int?]* => [[db][get,persons]] -> [dedup,[get,name]] => [] -> [order,[gt,[get,name]]] => []]`````` ###### Null Values Unless otherwise stated, a row value can be `null`. If a field should not contain a `null` value, then `NOT NULL` can be specified which amounts to a `{1}` quantifier in mm-ADT-bc. ``````CREATE TABLE persons ( name varchar(255), age int NOT NULL, PRIMARY KEY (name,ASC), )`````` ``````[db][put,persons,[name:@string,age:@int]* => [[db][get,persons]] -> [dedup,[get,name]] => [] -> [order,[gt,[get,name]]] => []]`````` ###### Check Constraints A row value can be constrained to always satisfy a particular predicate. For instance, in the person domain model, ages should never be negative. This is specified in SQL using `CHECK` and in mm-ADT-bc using an anonymous `int` type. ``````CREATE TABLE persons ( name varchar(255), age int NOT NULL, PRIMARY KEY (name,ASC), CHECK (age >= 0) )`````` ``````[db][put,persons,[name:@string,age:int<gte(0)>]* => [[db][get,persons]] -> [dedup,[get,name]] => [] -> [order,[gt,[get,name]]] => []]`````` ###### Foreign Keys A field value can refer to another row in a table. The value must be uniquely identifying such as the referent table’s primary key. Foreign keys provide a way to link "objects" which is expressed in mm-ADT-bc using a references. ``````CREATE TABLE persons ( name varchar(255), age int NOT NULL, manager varchar(255) NOT NULL, PRIMARY KEY (name,ASC), CHECK (age >= 0) FOREIGN KEY (manager) REFERENCES person(name) )`````` ``````[db][put,persons,[name:@string, age:int<gte(0)>, manager:([name:@string~] => [[db][get,persons][has,name,eq,$0]])]*
=> [[db][get,persons]]
-> [dedup,[get,name]]      => []
-> [order,[gt,[get,name]]] => []]``````

When specified in mm-ADT-bc, only those instruction rewrites that can not be deduced from the type object are denoted. An implementation may add instruction rewrites founded on type information. The following type definition includes all such instructions.

``````[db][put,persons,[name:@string,
age:int<gte(0)>,
manager:([name:@string~] => [[db][get,persons][has,name,eq,$0]])]* => [[db][get,persons]] -> [dedup,[get,name]] => [] -> [order,[gt,[get,name]]] => [] -> [hasKey,type,string] => [] -> [hasKey,eq,name|age|manager] => [] -> [has,name,type,string] => [] -> [has,age,type,int<gte(0)>] => [] -> [has,manager,type,[name:@string]] => [] -> [has,age,lte|eq,int<lt(0)>] => [#]]`````` ##### Indices A database index is a secondary structure that provides direct access to subsets of the primary structure. Typically, a database index is able to transform linear `O(n)` "scan and filter" set reductions into sublinear `O(1)` or `O(log n)` set reductions via a hash-index or B-tree index, respectively. In mm-ADT, index-based data access paths are modeled with references.  An index is denoted by a filter-instruction that returns a reference to a subset of the current reference’s referents. ###### Single Key Indices The following SQL statement creates a table with respective index on name. The index is defined as being `UNIQUE` and therefore, no two rows in the table can share the same name value. ``````CREATE TABLE persons ( name varchar(255), age int ) CREATE UNIQUE INDEX persons_name_idx ON persons (name)`````` The above SQL construction is expressed in mm-ADT-bc as follows. ``````[db] [define,person,[name:@string, age:@int]] [define,people,person* -> [has,name,eq,@string~] => person[name:$0]?]
[put,persons,people => [[db][get,persons]]]``````

There is a data access path that subverts the need to dereference `people`. Furthermore, because all `person` names are unique, the resultant `@person[name:$0]?` stream is guaranteed to have 0 or 1 `persons`. This is expressed with the type’s `?`-quantifier. It is through such mechanisms that a storage system exposes its indices. In particular, this is how `persons_name_idx` is exposed. If the storage system uses a relational-ADT, then the following SQL query leverages the same secondary structure as a `person[name:$0]?` reference.

``SELECT age FROM persons WHERE name='marko'``

The SQL query is represented in mm-ADT-bc below.

``````[db][get,persons]
[has,name,eq,marko]
[get,age]``````
 An index query is a subtype reference of a reference. An index lookup is the process of dereferencing an index query. The query result is the dereferenced instance.
###### Multi-Key Indices

A multi-key index (composite index) is modeled with multiple `[has]` instructions chained together in a reference path.

``````CREATE TABLE persons (
name varchar(255),
age int
)

CREATE INDEX persons_name_age_idx ON persons (name, age)``````

The 2-key multi-index is defined in mm-ADT-bc below.

``````[db]
[define,person,[name:@string, age:@int]]
[define,people,person*
-> [has,name,eq,@string~]   => (person[name:$0]* -> [has,age,@pred~,@int~] => person[name:$_0,age:$0($1)]*)
-> [has,age,@pred~,@int~]   => (person[age:$0($1)]*
[put,lang,$0]]] [define,person -> [put,lang,@string] => [error] // only the canonical source can be updated -> [drop,lang] => [error]] [db][get,projects] [has,title,eq,ripple] [put,lang,rdf]`````` ##### Aggregations Common database barrier functions such as counting, summing, averaging, grouping, ordering, etc. are more often than not, best accomplished by the storage system (especially when the storage system is not under significant load from concurrent queries). The storage system can offer computational services for operations it can execute more efficiently than the processor via references. A collection of typical aggregations are provided below in mm-ADT bytecode and SQL. The following diagram and mm-ADT-bc source specify a reference graph for offloading barrier operations to the storage system. mm-ADT bytecode SQL query `[db][get,persons][count]` `SELECT COUNT(*) FROM persons` `[db][get,persons][get,age][sum]` `SELECT SUM(age) FROM persons` `[db][get,persons][group,[get,age],[get,name],[fold]]` `SELECT * FROM persons GROUP BY age, name` `[db][get,persons][order,[lt,[get,age]]]` `SELECT * FROM persons ORDER BY age DESC` `[db][get,persons][get,age][dedup]` `SELECT DISTINCT age FROM persons` ``````[db] [define,people,person* -> [count] => long -> [get,age] => (int* -> [sum] => int -> [dedup] => int*) -> [group,[[get,age]],[[get,name]]] => [(@int:[@string*])*] -> [order,[lt|gt,[get,age]]] => @person*] [put,persons,people => [[db][get,persons]]]`````` The above `people` definition does not have to be defined by the user. It is possible for the storage system to provide an anonymous type (at runtime) with various exposed data access paths not specified in the model-ADT embedding. Compile-time bytecode rewriting and runtime processor redirections leverage the same reference graph structure and the evolution of this structure does not have to be a priori fully defined. In this way, the storage system can play an active role in runtime query optimization. ##### Proxies A reference object is a proxy to a referent instance. A reference can be manipulated as if it were the referent — providing data when read from and storing data when written to. What makes a reference different from its referent is that it can contain only that data necessary for it to access its referent via the type’s data access path. For large objects such as documents in a document database, references can be used to reduce object creation and limit storage/processor communication. mm-ADT references are useful as proxies in the following scenarios. 1. Remote Proxy: The processor is in a different address space (perhaps on a different physical machine) than the storage system node containing the referent instance. Network communication can degrades query performance. 2. Lazy Proxy: The bytecode will not access all the data in the referent and thus, it is not necessary to load the entire object into memory. Assume a document-ADT storing JSON documents with instances similar to the document below and matched by the following mm-ADT `person` definition. ``````{ "name" : "marko", "age" : 29, "alive": true, "address": { "street" : "el camino", "city" : "santa fe", "state" : "nm"}, "phones": { "home" : "212 555-1234", "office" : "646 555-4567", "mobile" : "123 456-7890"}, "spouse": null }`````` ``````[db] [define,person,[name:@string, age:@int, alive:@bool, address:[street:@string, city:@string, state:@string], phones:[(@string:@string)*], spouse:@person?] -> [has,name,eq,@string~] => person[name:$0]?]
[put,persons,person* => [[db][get,persons]]]``````

The `person` definition above states that the storage system provides direct access to person documents when search by name. The following bytecode retrieves a single `person` document and then walks the document tree to ultimately yield the expected result of "212 555-1234". While a storage system index is being used, object materialization and network communication can still be further reduced. The unfortunate aspect of the query is that the entire phones record is fetched when the `person` is dereferenced.

``````[db][get,persons]        // people reference
[has,name,eq,marko]  // person reference
[get,phones]         // phones instance
[get,home]           // string instance``````

The mm-ADT `[ref]` machine instruction generates a new reference from another reference or instance. The provided anonymous super-type (less constrained) is used to populate the new reference with only that data that will be needed for the remainder of the query and thus, reducing the memory footprint and network bandwidth to the absolute required minimum.

``````[db][get,persons]                            // people reference
[has,name,eq,marko]                      // person reference
[ref,person[name:@string,
phones:[(home:@string)?]]]   // person reference
[get,phones]                             // phones reference
[get,home]                               // string instance``````

The equivalent query using a MongoDB query document is provided below.

``db.persons.find({name:marko,phones:{home:{$exists}}},{phones.home:1})`` ##### Co-Locations #### Compile- vs. Run-Time For a query language to be mm-ADT compliant, there must exist a compiler that transforms queries written in the query language’s syntax to mm-ADT bytecode. Given that mm-ADT bytecode is always written with respects to the model-ADT’s primary structure, the compiler is not required to understand, nor factor-in, the storage system’s secondary structures. mm-ADT bytecode is optimized using two primary techniques. 1. Algebraic Equivalences: mm-ADT bytecode is a structural representation of a stream-computing process graph. Stream ring theory’s algebra can be used for compile-time alterations of the process graph. 2. Reference Redirections: mm-ADT bytecode is written for the model-ADT’s primary structure. References enable the runtime redirection of the processor in order to exploit more efficient data access paths through the structure graph. The reference graph and `[cost]` instruction can be used for both compile- and run-time bytecode optimization. ##### Compile-Time Optimization ##### Runtime Optimization ## model-ADT Structures and Processes A database model is a type of data model that determines the logical structure of a database and fundamentally determines in which manner data can be stored, organized and manipulated. The most popular example of a database model is the relational model, which uses a table-based format. — Wikipedia Article on Database Model mm-ADT can be used to model popular database ADTs. However, because mm-ADT’s objects and instructions have pre-defined semantics, it is typical for most model-ADT’s to not be in one-to-one correspondence with (a subset of) mm-ADT. In such situations, the model-ADT needs to be embedded. Fortunately, these embeddings are not as complex as model-to-model embeddings (e.g. property graph to relational) as mm-ADT provides a flexible data model and universal instruction set. This section presents a collection of popular database model-ADT’s, demonstrates their mm-ADT embedding, and outlines secondary structures used to optimize common query patterns. ### Key/Value Store model-ADT A key/value store’s primary data structure is a hash table. This ADT supports three basic operations: `get`, `put`, and `drop`. A value is put into the store via its associated key. A value is retrieved from the store via its associated key. A value is removed from the store via its associated key. All keys must be unique. The key/value-ADT’s primary structure can be physically distributed with relative ease as each key/value pair is an atomic datum. Distributed key/value stores are based on distributed hash table principles. The classic key/value model-ADT will be presented first and then contemporary extensions will be explored. #### Primary Structure A classic key/value model-ADT ``````[model,kv=>mm, [define,key,@number|@string|@bytes], (1) [define,value,@bool|@number|@string|@bytes|@sequence], (2) [define,keyvalue,[@key~,@value] => [[db][has,0,eq,$0]]                      (3)
-> [put,0,@key] => [error]                                                (4)
-> [drop,0|1]   => [error]],                                              (5)
[define,db,keyvalue*                                                        (6)
-> [put,keyvalue[@key~,@value~]] => [coalesce,[[has,0,eq,$0][put,1,$1]],
[put,[$0,$1]]]              (7)
-> [dedup,[get,0]]               => []]]                                  (8)``````
 1 The supported key types. 2 The supported value types. 3 A keyvalue is a `list` containing a key and a value. 4 A keyvalue’s key is immutable. 5 A keyvalue must always contain a key and a value. 6 A key/value store is a stream of keyvalue objects. 7 If a keyvalue with provided key exists, update its value, else insert provided keyvalue (update or create). 8 All keyvalue keys are unique.

#### Secondary Structures

The simplicity of the key/value data structure and the limited expressivity of its operations means that there are few secondary structures that can be leveraged in a meaningful way. The most important secondary structure is a hash-based index on key. Other indices that support other predicates beside equality may be useful. If the key/value pairs are sorted, then range queries based on key are more efficient to execute. Finally, most key/value stores support the parallel processing of all key/value pairs using the MapReduce pattern. The efficient implementation of this reference path requires a parallel, distributed processor whose workers are able to directly access data local partitions of the full key/value pair stream.

reference path structure description

`[has,0,eq,@key]`

index

key/values indexed by key equality

`[has,0,@pred,@key]`

index

key/values indexed by key predicate

`[put,@keyvalue]`

function

storage system performs "update or create" locally

`[count]`

aggregation

key/value count

`[order,[gt|lt],[get,0]]`

sort order

key/values sorted by key

`[has,0,between,[@key{2}]]`

sort order/index

key/values sorted by key and accessible by range

`[group,@bc{3}]`

processor

key/value processing via MapReduce

There is no standard query language for key/value-stores. They are typically interacted with via a language-specific API. An example mm-ADT-bc source is provided below that demonstrates the use of the various defined instructions.

``````[db][count]                            // 0
[db][put,[a,[name:marko,age:29]]]
[put,[c,[name:josh,age:32]]]
[put,[d,[name:peter,age:10]]]
[db][put,[d,[name:peter,age:35]]]      // updated d value
[db][count]                            // 4
[db][has,0,eq,a]
[get,1]                            // [name:marko,age:29]
[db][group,[[get,1]
[get,age]
[ifelse,[is,gt,30],
[obj,old],
[obj,young]]],
[obj,1],
[sum]]                     // [old:2,young:2]``````

A property graph is network data structure composed of a set of vertices connected by a set of edges. In a directed binary graph, edges connect two vertices: a tail vertex and a head vertex. If the graph is multi-relational, then edges are typed. If the graph is attributed, then vertices and edges have key/value properties. Different graph database providers have slightly different property graph models. The model popularized by Apache TinkerPop3 will be presented first and then common variations will be discussed.

#### Primary Structure

``````[model,pg=>mm,
[define,key,string<neq(id)&neq(label)>],                                             (1)
[define,value,@bool|@number|@string],                                                (2)
[define,property,@key:@value]                                                        (3)
[define,element,[id:@object,label:@string,@property*]                                (4)
-> [drop,(id|label)]              => [error]                                       (5)
-> [put,(id|label),@object]       => [error]                                       (6)
-> [is,eq,@element[id:@object~]]  => [has,id,eq,$0]], (7) [define,vertex,element[id:@object~,inE:@edge*,outE:@edge*] (8) => [[db][has,id,eq,$0]]                                              (9)
-> [get,outE] => (edge[outV:$_]* -> [put,@edge~] => [branch,[put,$0],
[[get,inV][get,inE]
[put,![ref,$0]]]]), (10) -> [drop] => [[branch,[_], [get,inE], [get,outE]] [drop]]], (11) [define,edge,element[id:@object~,outV:@vertex,inV:@vertex] (12) => [[get,outV][get,outE][has,id,eq,$0]]                (13)
-> [put,(outV|inV),@vertex] => [error]                                             (14)
-> [drop,(outV|inV)]        => [error]                                             (15)
-> [drop]                   => [[branch,[_],
-> [is,eq,@statement]  => [[has,s,eq,$_0][has,p,eq,$_1][has,o,eq,$2]] (5) -> [put,s|p|o,@object] => [error] (6) -> [drop,s|p|o] => [error]] (7) [define,db,statement* (8) -> [dedup] => [] (9) -> [put,@statement~] => [[coalesce,[[db][is,eq,$0]],[[put,$0]]][db]]]] (10)``````  1 A uri is a string that matches the provided regular expression pattern. 2 A literal is a primitive object 3 A blank node is a locally resolvable URI 4 A statement is record containing a subject, predicate, and object. 5 Two statements are equal if they have the same subject, predicate, and object. 6 A statement is immutable. 7 A statement is immutable. 8 An RDF-store is a stream of statements. 9 An RDF-store does not contain duplicate statements. 10 If the statement already exists, do nothing, else insert statement. #### Secondary Structures reference path structure description `[has,s,eq,@uri|@bnode]` index statements indexed by subject `[has,p,eq,@uri]` index statements indexed by predicate `[has,o,eq,@uri|@bnode|@literal]` index statements indexed by object ```[has,s,eq,@uri|@bnode] [has,p,eq,@uri]``` index statements indexed by a composite of subject/predicate ```[has,o,eq,@uri|@bnode|@literal] [has,p,eq,@uri]``` index statements indexed by a composite of object/predicate ```[has,s,eq,@uri|@bnode] [has,p,eq,@uri] [at,[db], [[has,s,eq,![get,o]] [has,p,eq,@uri]]]``` denorm length-2 statement paths pre-joined by object/subject `[count]` aggregate statement count ```[has,p,eq,@uri] [count]``` index/aggregate statement count by predicate #### Example Compilations Each table below presents an example SPARQL pattern-match and two respective compilations to mm-ADT bytecode. If a reference path is available (given the previously presented secondary structures), then the path is written on a single line. Thus, the fewer lines of bytecode, the more secondary structures are being used to solve the query. language How many knows edges exist in the graph? SPARQL `SELECT COUNT(?s) WHERE {?s ex:knows ?o}` mm-ADT-bc (primary) ```[db] [has,p,eq,ex:knows] [count]``` mm-ADT-bc (primary + secondary) `[db][has,p,eq,ex:knows][count]` language What is the average age of the people that marko knows? SPARQL ```SELECT AVG(?b) WHERE { ex:marko ex:knows ?a ?a ex:age ?b .}``` mm-ADT-bc (primary) ```[db] [has,s,eq,ex:marko] [has,p,eq,ex:knows] [at,[db], [[has,s,eq,![get,o]] [has,p,eq,ex:age]]] [get,o] [mean]``` mm-ADT-bc (primary + secondary) ```[db][has,s,eq,ex:marko][has,p,eq,ex:knows][at,[db],[[has,s,eq,![get,o]][has,p,eq,ex:age]]] [get,o] [mean]``` language What project did marko’s colleague create with him who also created ripple? SPARQL ```SELECT ?a ?b WHERE { ex:marko ex:knows ?a ex:marko ex:created ?b ?a ex:created ?b ?a ex:created ex:ripple . }``` mm-ADT-bc (primary) ```[db] [match, [[at,[db], [has,s,eq,ex:marko]] [has,p,eq,ex:knows] [get,o] [as,a]], [[at,[db], [has,s,eq,ex:marko]] [has,p,eq,ex:created] [get,o] [as,b]], [[path,a] [at,[db], [has,s,eq,![_]]] [has,p,eq,ex:created] [get,o] [as,b]], [[path,a] [at,[db], [has,s,eq,![_]]] [has,p,eq,ex:created] [has,o,eq,ex:ripple]]]``` mm-ADT-bc (primary + secondary) ```[db] [match, [[at,[db],[has,s,eq,ex:marko]][has,p,eq,ex:knows] [get,o] [as,a]], [[at,[db],[has,s,eq,ex:marko]][has,p,eq,ex:created] [get,o] [as,b]], [[path,a] [at,[db],[has,s,eq,![_]]][has,p,eq,ex:created] [get,o] [as,b]], [[path,a] [at,[db],[has,s,eq,![_]]][has,p,eq,ex:created][has,o,eq,ex:ripple]]]``` ### Document model-ADT A document database is composed of a set of named collection containing documents. Popular document structures include XML, JSON, and YAML. Document are hierarchical (tree-like) and isolated from each other. That is, documents do not refer to each other. Document isolation facilitates distribution as each node in the cluster is responsible for a range of documents (hashed by a shared field such `_id`). A collection is searched using document pattern-matching. A proto-document containing predicates is submitted as a query and all documents that match the query document are accessible via a result iterator. This section will present a JSON-based document model-ADT. #### Primary Structure A Document model-ADT ``````[model,doc, [define,document,dobject[_id:@string]], [define,dobject,[(@string:@value)*]], [define,dlist,[@value*]], [define,dvalue,@string|@number|@bool|@dobject|@dlist], [define,db,[(@string:@document*)*] -> [dedup,[get,_id]] => []]]`````` #### Secondary Structures reference path structure description `[has,_id,eq,@string]` index documents indexed by _id value ```[get,@string] [has,@string,eq,@dvalue]+``` index collection documents indexed by composite of first-level fields ```[get,@string] ([get,@string] [is,@pred,@dvalue]*)+``` index collection documents indexed by nested pattern ```[get,@string] [is,typeof,@type]``` index collection documents indexed by pattern-matching ```[get,@string] [count]``` aggregate collection document count `[drop,@string]` function delete a collection of documents ```[get,@string] [dedup,[get,@string]+]``` function collection documents dedup’d by field `[group,@bc{3}]` processing document processing via MapReduce ``````db.persons.insertOne({name:marko, age:29, projects:[{name:lop,lang:java}], knows:[{name:josh},{name:vadas}]}) db.persons.find({"knows.name":{$regex:/^j.*/}},{name:1,age:1})``````
``````[db][get,persons]
[put,[name:marko,
age:29,
projects:[[name:lop,lang:java]],
[db][get,persons]
[filter,[[get,knows]
[get,name]
[is,regex,'^j.*']]]
[ref,[name:@string,age:@int]]``````

## Glossary

A data structure specification containing data types, instructions, and their operational semantics. An instance of an ADT is a data structure operated on with instructions from the ADT instruction set.

ADT-1 can be losslessly encoded in ADT-2 if ADT-2’s data types and instructions form a mathematically equivalent superset of ADT-1’s specification. An embedding requires the existence of a homomorphism from the data types and instructions of ADT-1 to the data types and instructions of ADT-2.

Anonymous type

A type that has no defined name symbol.

Bytecode

A linear, nested collection of instructions from the mm-ADT instruction set. Bytecode represents the functional composition of instructions into a procedural program used to read/write mm-ADT data structures.

Denormalization

The duplication of data within the primary structure as a means of memoizing frequented data access paths.

Dereference

The conversion of a reference to its referent.

Embedding

An ADT modeled using the data types and operations of another ADT.

Function reference

A reference containing instruction objects that more optimally executed by the storage system. Examples include count, max, group, dedup, etc. instructions.

Index

An secondary structure that provide sublinear time access to objects in the primary structure. In mm-ADT, an index is any reference with a filter-instruction to a resultant reference containing a subset of the original reference’s referents.

Instruction

An atomic unit of processing. An instruction maps a sequence of zero or more input objects to a sequence of zero or more output objects. Instructions are composed to express complex computations.

A multi-model ADT composed of primitive and sequence data types and map, filter, reduce, and branch instructions.

Named type

A type pattern that has a defined name symbol.

Noop reference

A reference with instructions that map to `[]`.

NoSQL

A post-millennium technology movement focused on the development of scalable, distributed database technologies via non-relational ADTs that can more readily be distributed at the expensive of query expressivity.

Instance

An object that can match one and only one structure: it’s self.

List

An mm-ADT sequence data type composed of ordered objects accessible by an integer index.

Pattern

A pattern is a data structure that describes a data structure. A pattern may be general enough that is describes many structures (type) or it may be specific enough that it describes only one structure (instance).

Pointer reference

A reference used to combine data into complex data structures.

Processing engine

A software program that can generally execute functions on data. An mm-ADT compliant processing engine can generate an execution plan from submitted bytecode. A processing engine manipulates an mm-ADT storage system’s data structure as a means of computing on the embedded model-ADT.

Proxy reference

A sparsely populated reference to a remote referent.

Query language

A human-oriented computer language for controlling a processing engine. While the term "query" implies read-only, in practice, it also includes mutations. An mm-ADT compliant query language maintains a compiler from the query language to mm-ADT bytecode.

Bytecode strategy

A transpiler for mapping a bytecode’s instruction-tree to another instruction-tree. Optimization-oriented bytecode strategies leverage data statistics and expression equivalences to generate an instruction-tree that is more efficient to execute yet semantically equivalent to the original bytecode instruction-tree.

Record

An mm-ADT sequence data type composed of key/value pairs called fields.

Reference

A proxy to an referent object. References maintain the type of and data access path to the referent.

Referent

An object that is referred to by a reference.

Schema

A specification of the structure of data. mm-ADT types can be used to model database schema constraints.

Storage system

A software program that physically encodes a data structure onto disk, into memory, or distributed across a multi-machine compute cluster. Beyond data storage, these systems typically manage transactions, concurrency, distribution, caching, backup, and other persistence related features.

Stream

A unbounded sequence of objects accessible only through iterative, lazy evaluation.

Substream reference

A reference to a subset of the referents of another referent. Database indices are modeled as substream references.

Subtype

Type X is a subtype of type Y if X matches a subset of the objects matched by Y.

Type

An object that matches more than one structure.

## Conclusion

Built-In Type System
• Should there only be `integer` and `floating` as `short`,`long`, `double`, etc. can be modeled as constraints.

• What is the general nature of the primitive operators?

• Is the `[op]` instruction a good idea (type safety issues)?

• Should math be contained in a single instruction `[math,expression]`?

Instruction Set
• The instruction set is about 75% complete.

• Many instructions from TP3 are no longer needed.

• given generalized bytecode arguments.

• given all structures are lists or records.

• `[unfold]` is a "sneaky" data access method (consider it in models)

Exception Handling
• There is simply an `[error]` instruction (with no specification).

• Are there different levels of failure?

• Can errors be caught? (exception handling)

Type Derivations
• Is the & and | model of combining objects to create types and instances good?

• The model just sort of popped out without much thought.

Transaction Handling
• Thus far it is assumed bytecode is contained in a transaction.

• Does mm-ADT need to consider transactions?

• TP3 uses "traversers" and stream ring theory uses "object threads."

• How should we specify paths and sacks?

• Stream ring theory’s object metadata model is more general than TP3.

• Stream ring theory’s object metadata model is a register-based system.

• "Traversers" are realized as mini-CPUs.

• Developer an ANTLR-based compiler.

• Ensure no ambiguities in the mm-ADT-bc language.

• Work towards a syntax that provides an easy to develop parser.

• Develop a Java-based mm-ADT object API.

• Study the practical nature of references.

Cost instruction
• Develop `[cost]` as a way to determine the expense of various data paths.

• Develop the heap model-ADT as an example of a general-purpose computer.

• Develop property graph model-ADT variants (n-ary, multi-props, meta-props, etc.)

• Develop a matrix model-ADT for expressing matrix computations.

• This should really test the expressivity of the instruction set.

Discuss the Practical Stream Computing
• Akka-style message passing (multi-machine)

• Spark-style serial scans (multi-machine)

• Pipes-style chained iterators (single-machine)

• Rx-style push-based semantics (single-machine, multi-threaded)

• Most of this is already implemented in tp4/ branch.

Bytecode Optimization through Embedding
• Is it possible to represent bytecode strategies as automorphic `[model]` embeddings (bc⇒bc).

Supertype for Record and List
• struct and stream, where list and record are subtypes of struct. ?

Co-Location of Instance and Reference
• Develop a way for mm-ADT objects to migrate throughout a cluster.

• Think in terms of a single `mmADTInputFormat` and a message passing instruction subset for query routing.

The Database Object
• The `[db]` instruction and object need to be thought through more clearly.

• Does every node in the cluster have a `db` object?

• Is the `db` object provide access to both storage and processing?

• How are storage systems and processing engines connected to/configured?

Include the Coefficient Ring
• Stream Ring Theory is founded on the product of two rings: the function ring and the coefficient ring.

• Develop a syntax for coefficients in bytecode and coefficients in objects.

Develop the Stream Ring Theory Section
• Provide mm-ADT-bc rewrite rules that implement the axioms and theorems of stream ring theory.

• Demonstrate compilation via algebraic manipulation.

The Role of the Match Instruction
• Should this be implemented using more primitive instructions?

• The runtime cost should be handled by a reference.

• Can an ingenious reference structure yield match semantics without the actual instruction?

Specify Casting
• How do you change the type of an object?

• We currently just `[ref]` it, but this is only useful for streams.

• How do you redefine a type? `[define]`?

Namespaces
• As we include more objects and models, it will be necessary to develop a namespacing system.

Join Algorithms
• The `[at]` instruction is providing nested loop join semantics.

• Need better examples demonstrating sort-merge join and the role of references in the decision process.

• A rough example of a hash join implementation was demonstrated using instruction rewrites.

• This was interesting in that the reference did a complex computation and stored results.

• References as complex computational entities?

• If we can add some form of memoization (aggregate/barrier), we can do describe/execute more efficient joins.