50 Years of SQL, can you imagine a truly NoSQL approach ?

Discussion about the development of a NoSQL API

Table of Contents

Introduction

In a decade of investigating NoSQL systems, I noticed a huge effort from many vendors to create SQL compatible APIs. Yes, there is a pretty good reason behind this tendency to walk-the-line as usual. Everybody has learned to program in SQL, everybody knows how to access a DBMS using SQL.

Well, I guess after 50 years of SQL based DBMS there has to be a better approach to win programmer’s acceptance but it seems nobody has managed to achieve such an ambitious goal in scale. Nobody dares to propose something different to be adopted as a better solution and even those few that tried something different, it just comes to my mind the Apache Tinkpop-Gremlin language, GraphQL and Cypher, those brave hearts have not managed to convince the majority of IT users to switch gears to their proposed query environment. Others were more radical in their approach and even imagined a connected semantic web, but when they designed their query language, SPARQL, it looked very similar to SQL, and even its acronym reminds us SQL.

I can start writing many reasons for why all these approaches have not reached the consensus for the majority of IT technical users. I could also state a number of facts for why the industry and several consortium keep using SQL as their favourite DBMS query language. But with this article I would like you to focus on a very specific point, and with that perhaps I can share a bit of my imagination on how it can be possible to access a DBMS differently, with more flexibility than SQL, without losing the comfort of your programming language, without the hurdles in SQL and most important with earning the interest and respect of developers.

Two programming paradigm approaches for a NoSQL API

What is SQL ? It’s a domain specific language designed specifically to manage data in a relational DBMS, although many proponents of relational theory correctly argue that these DBMS do not follow Codd’s relational principles. But that is another important story to tell some other time.

Recently, in the last decade, graph databases started becoming very popular but interestingly many important vendors stretched SQL language to cover graph traversal and other related operations. Columnar databases another very successful NoSQL type also followed the same tactic regarding to the query language. Therefore, although the data model and the physical layer implementation can vary a lot, SQL attempts to create an artificial unity at the database client level.

But again SQL is domain specific, started as data model specific and remains a declarative query language. The main question is what happens at the API side, i.e. at the side of the programming language and that is exactly where all interesting, peculiar things happen. How exactly do you connect to the DBMS server, what is the protocol DBMS driver uses to transfer data, how well objects match with entities and attributes in your DBMS, schema vs object model changes, nullable and partial loading of fields-atrributes, persistence, state, concurrency and caching of objects. For OOP aficionados and others what I briefly described is known as the object-relational impedance mismatch problem, the Vietnam of Computer Science. And do not make a mistake it is still “Vietnam” out there.

Chainable operators and operands OOP is perhaps the most popular programming paradigm and yet because of the SQL magic spell everything had to be squeezed on a tabular form using one ORM or the other. But during all these years of my personal research and development effort I noticed one specific characteristic of these ORM APIs that is related to the data retrieval approach in use. There are three major trends here, Query-By-Example (QBE) the precursor of GraphQL, Query-By-API (QBA) the precursor of Gremlin and Query-By-Language (QBL) where all the SQL like query languages are categorised, GQL is not an exception. Developer’s practical experience showed that certain complex queries, particularly those with joins, were more difficult to represent in QBA and QBE and once more SQL won again this battle. But it is not a surprise that proponents of graph database technology have only partially touched the object-relational impedance mismatch. It requires a deeper architectural design and multi-perspective attitude than only comparing how well nodes of the graph match objects of a class.

So, was that a fair battle between QBA and QBL, the answer is no because of the SQL existing dominance in DBMS and the mimicking of SQL join operations. That is where imagination enters the game, imagine that you could use object chainable operations using the fluent interface design method, a.k.a QBA, without the mismatch between the DBMS and OOP, i.e. imagine a loose connection between the DBMS and API, think of API database client more like a bridge to fetch or transfer data. The method of chaining query operators is very promising indeed, it can won a second round of boxing with SQL. I will briefly explain why.

OOP is based on the concept of objects that can exchange messages and modify their internal state. Therefore that kind of fluent interface is very natural in OOP languages, on the other hand most complex queries can be visualised and processed as a data pipeline. My question to those that design GQL, the next international query language standard, is why you keep following a dead-end declarative approach ? Wouldn’t it be more natural and appealing to the users to follow a data pipeline methodology by standardising operands and operations ? There is plenty of hard evidence to verify that this is not an assumption. Recall TinkerPop-Gremlin reference above and in Python there is Pandas, a very popular data analysis library (23621 starts at GitHub).

Functional operations

Speaking about programming languages, what is the other very competitive paradigm in software development. Arguably it is the functional one and yet the functional requirements for such an API are more obscure. But such an API exists in one of the most powerful functional programming languages of IT industry, Mathematica. See for example how carefully they designed Wolfram language functions on Database-Like Operations on Datasets and Computation with Structured Datasets and how perfectly these fit with many other functions of the same language.

A glimpse from the future

Enough said, software engineering is not a theoretical thing, it is based on practice. I will give you an example. One of the drawbacks of SQL is that you don’t have a direct access to manage data dictionary information, this is handled more or less automatically by the DBMS. Gartner recently reported that data quality and data integration tools are incorporating data catalogs. This is certainly a step towards the right direction, you need a query API that is capable to manage both data and metadata using a unified approach.

Consider the following code snippet in Python:

The first command demonstrates fetching metadata using OOP chainable python methods:

mis.get(221)
   .fields
   .over(select='nid, dim4, dim3, dim2, cname, alias, ntype, ctype, counter')
   .to_dataframe(index='dim4, dim3, dim2')
   .out()

Out[18]: 
                nid                 cname alias ntype ctype  counter
dim4 dim3 dim2                                                      
1    605  7     227              Duration    NA   FLD   CSV        0
          8     228            Start date    NA   FLD   CSV        0
          9     229              End date    NA   FLD   CSV        0
          10    230  Start station number    NA   FLD   CSV        0
          11    231         Start station    NA   FLD   CSV        0
          12    232    End station number    NA   FLD   CSV        0
          13    233           End station    NA   FLD   CSV        0
          14    234           Bike number    NA   FLD   CSV        0
          15    235           Member type    NA   FLD   CSV        0

The second command demonstrates fetching metadata using a functional approach:

mis.get(nid=221, 
        what='fields', 
        select='nid, dim4, dim3, dim2, cname, alias, ntype, ctype, counter',
        index='dim4, dim3, dim2', 
        out='dataframe')

Out[19]: 
                nid                 cname alias ntype ctype  counter
dim4 dim3 dim2                                                      
1    605  7     227              Duration    NA   FLD   CSV        0
          8     228            Start date    NA   FLD   CSV        0
          9     229              End date    NA   FLD   CSV        0
          10    230  Start station number    NA   FLD   CSV        0
          11    231         Start station    NA   FLD   CSV        0
          12    232    End station number    NA   FLD   CSV        0
          13    233           End station    NA   FLD   CSV        0
          14    234           Bike number    NA   FLD   CSV        0
          15    235           Member type    NA   FLD   CSV        0

And the third command demonstrates fetching data using again a functional approach

mis.get(221, what='data', 
             select='nid, dim4, dim3, dim2, cname, alias, ntype, ctype, counter', 
             out='dataframe', 
             limit=5, 
             offset=1000)

Out[20]: 
    534  2012-04-01 10:29:06  2012-04-01 10:38:00  31113  Columbia Rd & Belmont St NW  31201                                15th & P St NW  W00663  Member
0  1937  2012-04-01 10:29:07  2012-04-01 11:01:25  31202               14th & R St NW  31621              4th & D St NW / Judiciary Square  W00692  Casual
1   470  2012-04-01 10:29:17  2012-04-01 10:37:08  31104  Adams Mill & Columbia Rd NW  31200          Massachusetts Ave & Dupont Circle NW  W00020  Member
2   727  2012-04-01 10:29:30  2012-04-01 10:41:38  31103         16th & Harvard St NW  31200          Massachusetts Ave & Dupont Circle NW  W00880  Member
3  1144  2012-04-01 10:29:59  2012-04-01 10:49:03  31110     20th St & Florida Ave NW  31236        37th & O St NW / Georgetown University  W00681  Casual
4  1698  2012-04-01 10:30:09  2012-04-01 10:58:27  31107      Lamont & Mt Pleasant NW  31610  Eastern Market / 7th & North Carolina Ave SE  W01154  Member

Epilogue

I am sure it is not difficult for those few out there to imagine and even implement a better solution than SQL provided you have the motive and financial backup. In fact as I mentioned in this article that partly exists. But it is certainly far more challenging and difficult to loose the safety of your daily job and/or income for years to imagine and consequently strive to give flesh and bones to a truly NoSQL API that can be more productive, efficient and innovative to work with than what already exists. Keep an eye for the next release of TRIADB project and who knows you might be convinced that this is truly a unique and valuable tool to use.

Therefore, we need to investigate systems that combine relational algebra and linear algebra in a richer query paradigm, potentially as extensions to SQL. Two “holy grails” should continue to stay on our agenda. First, we must always explore any novel ideas to reduce the impedance mismatch between application development and writing database queries. Second, we must continue to find ways to make database systems less rigid (e.g., flexible schema evolution) without significantly sacrificing their performance 1

Cross-References


  1. The Seattle Report on Database Research - the report summarizes the discussion and conclusions of the 9th such meeting, held during 9-10 October 2018. It appeared in the December 2019 issue of ACM SIGMOD Record ↩︎

Athanassios I. Hatzis, PhD
Software Engineer - Researcher, Consultant, Independent Contractor

Related