Table of Contents
Normally IT people including architects, engineers, scientists and developers are forced to think a particular implementation of an application or a business solution in terms of the query language. Generally speaking most often it is the specific technology and infrastructure behind the scene that dictates how things should be done. Recently the first workshop on web standardization for graph data was held in Berlin, Germany. The title of the workshop “Creating Bridges: RDF, Property Graph and SQL” indicates that it is technology again, and its vendors, that force a narrow view of how to apply graph networks in data management. Hope they do not seek a solution that temporarily makes everyone happy in business, instead of taking a firm decision to move ahead on a new path that will be proven the right choice over time. That said, I do recognize that there are always compatibility issues with older technology in place but the bridges should serve the scope of a smooth transition from the old age to the new age.
The new brave age of hyperlinked data
But there is another factor more important than technology, the need to provide a business insight, understanding and/or solution to real world problems with increasing complexity. After all many recognize that this was the driving force behind the NoSQL movement that started ten years ago. Out of the three main technological categories of database products, key-value, document-hierarchical and graph networks, the last one emerged with significant importance for the future of IT business for many reasons that will not be analysed here. Suffice to say that WWW, the most advanced and popular information system, is the biggest graph network of documents and other resources in the history of human mankind. Today we reached the point in time that the maturity of NoSQL movement met the evolution of WWW but the question remains on how exactly these two will be married to serve human needs. A quick and sort answer is by hyperlinking data in the context of information.
Technical questions for hyperlinking data
Therefore the aim has been set but the competitive database, semantic and web technologies out there makes it too difficult for many to rise above the specifics. That is where a multi-layer perspective is needed to embrace the problem of hyperlinking data.
Does this mean users are going to abandon their favorite popular SQL row-based, column-based, document based or existing triple store and property graph databases ? No, not at all but this new technology should integrate them in the large picture which takes on account the following questions as it concerns the hyperlinked graph data network :
Although many vendors consider a native graph network storage layer as an advancement in graph databases, this approach has the following drawbacks:
- it results in poor data locality and you may consider this issue independent of horizontal scaling.
- you are forced to apply too early a graph layout on data especially on record-based data.
- it is not easy to implement quick and easy data transformations, i.e. the result from a data query in a tabular, columnar, hierarchical or graph network format. All four are needed for different purposes.
So what is the best type of storage engine that serves these specifications ? Would it be better to think in terms of a hybrid architecture ?
This is where the basic constructs of the technology must be thoroughly specified. Since we are talking about a graph network:
What type of nodes and edges we have ?
Do you allow key-value properties in nodes and edges ?
Does the edge connect only two nodes ?
What type of cardinality and linkage, i.e. unidirectional, bidirectional, they have ?
Speaking about linkage, both RDF and property graph databases are anchored on appropriate semantic labelling of edges and then storage and retrieval of information at the implementation level are dependent on this labelling.
However, have you considered that this extra piece of information, i.e. the label, in most cases can be a meta-data property of the node, e.g. type, category, role, characteristic, etc ?
For that reason labelling of edges should be left to the data modeller, business person, purely for conceptualising the business model. Here is perhaps the most critical question.
How do you use these constructs to build business data models and what are their semantics ?
This leads us inadvertently to the conceptual level.
I think RDF graphs and labeled property graphs and Topic Map graphs (hardly anyone talks anymore about this standard) and even entity-relationship graph in relational theory have varying types of constructs with big differences in meaning and use. Entities, attributes, properties, data resources, relationships, relations, associations, items, instances and many other concepts make the IT babel at the business, end-user level.
Here lies a major source of confusion for the newcomer, I used to be one of them, which is the distinction between classes and instances, metadata and data, terminological and assertion components. This distinction as many others in IT semantic world is artificial, nevertheless it plays an extremely important role in programming, in data retrieval and even in inferencing mechanisms.
If you think you can handle this distinction easily, I urge you to think again, because in my opinion these two concepts must be clearly separable and at the same time they have to be related like the hand mold and the letter types in typography. On the contrary I do not see this happening in our modern graph technologies, time does not permit to expand this argument but it is entangled with the following important observation.
By reference vs. by value
Those of us familiar with programming and the internals of computing know that there is another critical, fundamental distinction in data processing. How do you access the datum by reference or by value ?_ This was the moment in time that RDF/OWL framework presented the first signs of derailment. First they came up with the idea of universal resource locator (URL), then the concept was elevated to the universal resource identifier (URI) and the result was the web identity crisis for those old enough to remember the clash with the proponents of Topic Maps data modeling framework. If you ask me, that crisis has not been resolved yet.
The main reason behind web identity crisis is the use of namespace addressing, to connect, identify and locate data. The notion of web address has been semantically overloaded because of its textual form. On the contrary internet protocol for addressing and usually primary key mechanism of databases are numerical. Perhaps they thought that this way it would have been easier for developers and machines to do the linkage in the same way they use HTML anchoring. But I guess this is more or less a fallacy of TimBL’s linked data network where it was designed primarily for documents (html pages) not record-base (relational) data. The origin of this misconception is Memex and it is continued in our days with the invention of many different encodings, including RDFa, Microdata and JSON-LD with the intention of generating manual or semi-automated linked data in semi-structured, hierarchical forms and html documents.
Global Giant Graph (GGG) or WWW 3.0 will become a reality only if W3C consortium admit that they have to build likewise IP numerical addressing schema for concepts in data models and for hyperlinked data. It is this new protocol that will permit networking at semantic level, it is going to be the same protocol that we are going to use for identification purposes, the same protocol that will make each data element traceable and connectable, the same protocol that will be used in number-crunching computers. Moreover the whole process of constructing hyperlinked data, especially from structured data resources (databases), will be fully automated and hidden in the same way TCP/IP is hidden from internet users.
Epilogue The reader may realize that I have only touched various parts of the “elephant in the room”. But those key points that I tried to make digestible rose from my experience with semantic and database technologies and eventually from my long standing personal effort to develop S3DM/R3DM computational semiotics framework. This way I hope that I managed to enlighten a different path not in theory but in practice. And if there is one thing to remind you something from this article and assist you in following this path then let it be granular computing.