Table of Contents
The Variable-Value Pair (VVP)
Recently, 5th of November 2013, another serialization of RDF, JSON-LD was proposed by W3C. This made me think over about the fundamental data representation in computing systems and applications. It is the name–value pair, key–value pair, field–value pair or attribute–value pair according to Wikipedia.
Professional IT developers will recognise it also as the variable-value pair in programming, as the (Object, Attribute), Value pair in OOP, as the field-value in RDB databases, as the key-value in no-SQL databases, as the query string in URL, as the (SP)-O in RDF and as the attribute-value pair in the most popular data interchange format in our days the JSON open standard format. This demonstrates how powerful this “pair” is. Two fundamental observations arise from this comparison.
First, there is an emergent need to describe any kind of relation between two things in a natural, human readable way. And second, there is the power of using it as the cornerstone to build on top of it. Everything has to be expressed as VVP in a semantic layered approach, similar to what the Topic Map data model proposed, i.e. distinguish between associations between concepts and occurrences for the representation of them. In my opinion, this is where the semantic web stack fails as it has mixed the syntax, with retrieval of information and representation of it.
What I propose in R3DM data model is a clear separation of Information Realization (IREA), i.e. literals, data types, I call them DataIR or DIR (Data Information Resources) from Information Representation (IREP), i.e. image, web page, quantity, number, etc. I call them BinIR or BIR (Binary Information Representation).
Collectively I call all of them Information Resources (IRES), or Term Information Resources, TermIR or TIR for simplicity. This definition includes concepts from our daily life like humans, cars, trees and sky.
In order to avoid confusion on the definition of all these terms and for identification purposes, Information Reference (IREF), i.e. authority, user, host, domain, etc. has to be added as an extra dimension on the representation of information. I also call this term RefIR or RIR (Reference Information Representation).
As a reminder, one can easily arrive in the following memorable formula:
Information Resource = Information Reference + Information Representation + Information Realization
Or expressed in terms of things
And in an even simpler form as
In fact, the U1R3DM, Unified Information Resources with Reference, Representation, Realization Data Model proposes a new information theory that is based on the hypothesis that any information resource can be computed and represented as the unity of three components that happen to occur naturally in the description of Aristotle’s semiotic triangle. This is exactly what is illustrated in this post.
VVP and the identity crisis on the web
The more I am re-searching the subject of data modeling the more I understand the confusion that arises from the use of the variable-value pair, I am calling it VVP for short. Many of you will recognize that this is the fundamental unit of construction in computer hardware/software architecture including databases, i.e. Storage and retrieval mechanism. Nevertheless many fail to grasp the semantics behind this simple assignment statement of the VVP, i.e. VAR = VAL. Pointer logic in computer programming is very helpful here. Or think about a post box (container) with a street address and an envelope (content) inside. It’s also important to think about the type of container (post box) and the type of content (envelope).
What is the analogy with R3DM an identifier to (R)epresent the variable name, a.k.a container a literal to (R)ealize (materialize) the value, a.k.a content a memory address to (R)eference / De-Reference its content
You can see Variable from three perspectives, as an identifier, as an address and as a container with a type (data type). In the current semantic web of linked-data URI is overloaded with many roles, i.e. dereferencing, identification, serialization (Relational, Topic, Graph, JSON-XML). Is there a way to combine all these different perspectives in R3DM ? I think perhaps there is one.
Conceptual Model vs Data Model
RDF triplet, (Subject, Predicate, Object) or (Entity, Attribute, Value), strictly speaking, is neither a conceptual (information) model, nor a data model.
By definition conceptual model, is about concepts. It is about how I can deliver some message, I have in my head to someone else. Normally we use natural language, words, sentences, speech as the vehicle to convey that message. But that is not the only medium we use, think about art, figures, pictures, photos, diagrams, tactile feedback, music, video, and many others that are not based on words, that are not based on natural language.
So, let us assume for a moment that we focus on only textual representations. Is that SPO adequate as a constructing unit to model observations, data sets, or relations ? If yes, is there a standard way to structure concepts and relate instances ? Think about the components of SPO. Predicative units can become attributive and the opposite, object can take the position of a subject and the other way around, relations can be inverted.
But perhaps the most annoying of all is the abuse of that conceptual model to use it at the same time as a data model. That means to link values, such as numbers, strings and dates and binary objects. And of course everybody raises his hands when you ask the question, what exactly that value or that object represents. Or how about thinking the classic coding distinction in passing arguments. Are you calling by value or by reference ? Do you embed objects, are you building complex data types, how do you model the multi-value properties ? If you think there is a standard way to answer all these questions, think again because there is not one.
So, here comes R3DM into the game to make things more clear, Reference, Representation and Realization. Think about realization the encoding of 0s and 1s you have stored in memory. This is the only “real thing” you have stored. That is the lowest level of representation that computer understands. Then we have another level of representation, the highest one that we are familiar with, the conceptual level, it is about terms that we naturally think, that we naturally construct in our brains. And the third level of representation is an intermediate one. It is connecting the two worlds, the human one with the computer one. That is a vehicle, a medium we use to convey the message, the information according to semiotic triangle. We can use text, images, sound, video, etc to achieve this as long as we can digitize media.
the three levels of representation are unified as one and one is seen in three. That is the beauty of R3DM.
Data and Metadata
One of the main obstacles in data modeling is that we try to understand the mechanism of abstraction. In fact we are trying to model abstraction, to model the machine that creates models ! Recall at this point what is metadata. It is data about data. So you see how easily you can fall into a loophole.
Think about the “name” term. It is one of the most essential concepts we use to model, to reason, to refer, to identify. It is so hard to escape from naming and change everything to an alphanumeric id. Who is doing the naming, what does he mean, in what natural language ?
Name is just one of the many “signs” we use according to “semiosis” to carry the message, a piece of information. Think about any kind of programming paradigm, or any kind of data model or information model. You draw a flowchart, a nice diagram, to describe your model and at the same time you use symbols and words-terms to enhance it. All these are different kind of signs we use. So, do we have a model of the signs and how we use them to construct information and store, retrieve and update our data ? Not, at the moment I think.
Let’s move now to a term that apparently has become so popular in NoSQL movement, schemaless databases. That does not mean, there is not a schema. It means the database you use allows you to construct any kind of schema dynamically. In a sense we are talking about a database that has a generic kind of schema that enables you to produce any kind of schema.
Are you getting the picture now, data-metadata, schemaless-schema, data model - abstract model. It is an abstract machine, think about Turing machines, finite state machines, recursive functions, and lambda calculus and you get closer to it.
A Computational Semiotics Explanation
The following Wikipedia pages are related to realization, you may open and read about:
- Data type : en.wikipedia.org/wiki/Data_type
- Data structure : en.wikipedia.org/wiki/Data_structure
- Class : en.wikipedia.org/wiki/Class_(computer_programming)
- Object : en.wikipedia.org/wiki/Object_(computer_science)
You read the content of the web page with a specific presentation format suitable for the user and with at least three levels of encoding at the back for machine processing purposes.
- wiki encoding : Wiki data type (interpretation of bits)
- html encoding : HTML data type (interpretation of bits)
- machine code : Series of bits
To keep things simple, encoding is about the format you give for various purposes, for data storage, for human reading, for human listening, for execution of machine instructions, and many more. If you started thinking about file types, and serialization formats like RDF and JSON I must say you are on the right path. We are at the data level where everything is centred around the encoding of the content. If we apply recursion to reach the machine encoding level, we have to think about a base case. For example, you can encode an integer as a 16bit binary number or 32bit binary number.
How do we represent an integer. In programming we call it an identifier, the name of a variable. This is our entry level to the world of semantics. According to semiotics, it is about signs we use to represent information about an entity. To continue our example, if we want to encode and use a salary figure we can write:
Salary = $150000
The left part (and the dollar sign) is the (R)epresentation (container) and the right part is the value (content). It is the (R)ealization of a monetary value. Such identifiers in computer programming are many, property names, class names, method names, and relations.
Speaking about computer programming, the following Wikipedia pages are related to reference:
- Pointer : en.wikipedia.org/wiki/Pointer_(computer_programming)
- Reference : en.wikipedia.org/wiki/Reference_(computer_science)
- Key : en.wikipedia.org/wiki/Unique_key
- Identifier : en.wikipedia.org/wiki/Identifier
The concept of reference must not be confused with other values (keys or identifiers). In the previous example
- “Salary” is an identifier, a (R)epresentation of an integer value
- “Salary” is a reference, a (R)epresentation of the memory address where we encode the integer value.
To avoid that confusion, programmers defined another term, the pointer. Pointer (R)epresents a memory address. We dereference pointer to read the value, we use pointer to access memory block and store the value. The same idea is behind the dereferencing mechanism of WWW and the URL. But the URL reference has been overloaded:
- URL denotes the source of information, e.g. Wikipedia.
- URL is used to disambiguate terms, e.g. Object in object-oriented_programming.
- URL is used to fetch the html content of the web page
The way we view information resources in our model is mainly through the use of terms (Term Information Resource - TermIR). If you take any of the previous web addresses, you have the term, the realization of the term (DatumIR), the representation of the term (SignIR) and a reference (RefIR) to the content. These are three concepts in one, and one concept in three.