Table of Contents
In a decade of investigating NoSQL systems, I noticed a huge effort from many vendors to create SQL compatible APIs. Yes, there is a pretty good reason behind this tendency to walk-the-line as usual. Everybody has learned to program in SQL, everybody knows how to access a DBMS using SQL.
Well, I guess after 50 years of SQL based DBMS there has to be a better approach to win programmer’s acceptance but it seems nobody has managed to achieve such an ambitious goal in scale. Nobody dares to propose something different to be adopted as a better solution and even those few that tried something different, it just comes to my mind the Apache Tinkpop-Gremlin language, GraphQL and Cypher, those brave hearts have not managed to convince the majority of IT users to switch gears to their proposed query environment. Others were more radical in their approach and even imagined a connected semantic web, but when they designed their query language, SPARQL, it looked very similar to SQL, and even its acronym reminds us SQL.
I can start writing many reasons for why all these approaches have not reached the consensus for the majority of IT technical users. I could also state a number of facts for why the industry and several consortium keep using SQL as their favourite DBMS query language. But with this article I would like you to focus on a very specific point, and with that perhaps I can share a bit of my imagination on how it can be possible to access a DBMS differently, with more flexibility than SQL, without losing the comfort of your programming language, without the hurdles in SQL and most important with earning the interest and respect of developers.
Two programming paradigm approaches for a NoSQL API
What is SQL ? It’s a domain specific language designed specifically to manage data in a relational DBMS, although many proponents of relational theory correctly argue that these DBMS do not follow Codd’s relational principles. But that is another important story to tell some other time.
Recently, in the last decade, graph databases started becoming very popular but interestingly many important vendors stretched SQL language to cover graph traversal and other related operations. Columnar databases another very successful NoSQL type also followed the same tactic regarding to the query language. Therefore, although the data model and the physical layer implementation can vary a lot, SQL attempts to create an artificial unity at the database client level.
But again SQL is domain specific, started as data model specific and remains a declarative query language. The main question is what happens at the API side, i.e. at the side of the programming language and that is exactly where all interesting, peculiar things happen. How exactly do you connect to the DBMS server, what is the protocol DBMS driver uses to transfer data, how well objects match with entities and attributes in your DBMS, schema vs object model changes, nullable and partial loading of fields-atrributes, persistence, state, concurrency and caching of objects. For OOP aficionados and others what I briefly described is known as the object-relational impedance mismatch problem, the Vietnam of Computer Science. And do not make a mistake it is still “Vietnam” out there.
Chainable operators and operands OOP is perhaps the most popular programming paradigm and yet because of the SQL magic spell everything had to be squeezed on a tabular form using one ORM or the other. But during all these years of my personal research and development effort I noticed one specific characteristic of these ORM APIs that is related to the data retrieval approach in use. There are three major trends here, Query-By-Example (QBE) the precursor of GraphQL, Query-By-API (QBA) the precursor of Gremlin and Query-By-Language (QBL) where all the SQL like query languages are categorised, GQL is not an exception. Developer’s practical experience showed that certain complex queries, particularly those with joins, were more difficult to represent in QBA and QBE and once more SQL won again this battle. But it is not a surprise that proponents of graph database technology have only partially touched the object-relational impedance mismatch. It requires a deeper architectural design and multi-perspective attitude than only comparing how well nodes of the graph match objects of a class.
So, was that a fair battle between QBA and QBL, the answer is no because of the SQL existing dominance in DBMS and the mimicking of SQL join operations. That is where imagination enters the game, imagine that you could use object chainable operations using the fluent interface design method, a.k.a QBA, without the mismatch between the DBMS and OOP, i.e. imagine a loose connection between the DBMS and API, think of API database client more like a bridge to fetch or transfer data. The method of chaining query operators is very promising indeed, it can won a second round of boxing with SQL. I will briefly explain why.
OOP is based on the concept of objects that can exchange messages and modify their internal state. Therefore that kind of fluent interface is very natural in OOP languages, on the other hand most complex queries can be visualised and processed as a data pipeline. My question to those that design GQL, the next international query language standard, is why you keep following a dead-end declarative approach ? Wouldn’t it be more natural and appealing to the users to follow a data pipeline methodology by standardising operands and operations ? There is plenty of hard evidence to verify that this is not an assumption. Recall TinkerPop-Gremlin reference above and in Python there is Pandas, a very popular data analysis library (23621 starts at GitHub).
Speaking about programming languages, what is the other very competitive paradigm in software development. Arguably it is the functional one and yet the functional requirements for such an API are more obscure. But such an API exists in one of the most powerful functional programming languages of IT industry, Mathematica. See for example how carefully they designed Wolfram language functions on Database-Like Operations on Datasets and Computation with Structured Datasets and how perfectly these fit with many other functions of the same language.
A glimpse from the future
Enough said, software engineering is not a theoretical thing, it is based on practice. I will give you an example. One of the drawbacks of SQL is that you don’t have a direct access to manage data dictionary information, this is handled more or less automatically by the DBMS. Gartner recently reported that data quality and data integration tools are incorporating data catalogs. This is certainly a step towards the right direction, you need a query API that is capable to manage both data and metadata using a unified approach.
Consider the following code snippet in Python:
The first command demonstrates fetching metadata using OOP chainable python methods:
mis.get(221) .fields .over(select='nid, dim4, dim3, dim2, cname, alias, ntype, ctype, counter') .to_dataframe(index='dim4, dim3, dim2') .out() Out: nid cname alias ntype ctype counter dim4 dim3 dim2 1 605 7 227 Duration NA FLD CSV 0 8 228 Start date NA FLD CSV 0 9 229 End date NA FLD CSV 0 10 230 Start station number NA FLD CSV 0 11 231 Start station NA FLD CSV 0 12 232 End station number NA FLD CSV 0 13 233 End station NA FLD CSV 0 14 234 Bike number NA FLD CSV 0 15 235 Member type NA FLD CSV 0
The second command demonstrates fetching metadata using a functional approach:
mis.get(nid=221, what='fields', select='nid, dim4, dim3, dim2, cname, alias, ntype, ctype, counter', index='dim4, dim3, dim2', out='dataframe') Out: nid cname alias ntype ctype counter dim4 dim3 dim2 1 605 7 227 Duration NA FLD CSV 0 8 228 Start date NA FLD CSV 0 9 229 End date NA FLD CSV 0 10 230 Start station number NA FLD CSV 0 11 231 Start station NA FLD CSV 0 12 232 End station number NA FLD CSV 0 13 233 End station NA FLD CSV 0 14 234 Bike number NA FLD CSV 0 15 235 Member type NA FLD CSV 0
And the third command demonstrates fetching data using again a functional approach
mis.get(221, what='data', select='nid, dim4, dim3, dim2, cname, alias, ntype, ctype, counter', out='dataframe', limit=5, offset=1000) Out: 534 2012-04-01 10:29:06 2012-04-01 10:38:00 31113 Columbia Rd & Belmont St NW 31201 15th & P St NW W00663 Member 0 1937 2012-04-01 10:29:07 2012-04-01 11:01:25 31202 14th & R St NW 31621 4th & D St NW / Judiciary Square W00692 Casual 1 470 2012-04-01 10:29:17 2012-04-01 10:37:08 31104 Adams Mill & Columbia Rd NW 31200 Massachusetts Ave & Dupont Circle NW W00020 Member 2 727 2012-04-01 10:29:30 2012-04-01 10:41:38 31103 16th & Harvard St NW 31200 Massachusetts Ave & Dupont Circle NW W00880 Member 3 1144 2012-04-01 10:29:59 2012-04-01 10:49:03 31110 20th St & Florida Ave NW 31236 37th & O St NW / Georgetown University W00681 Casual 4 1698 2012-04-01 10:30:09 2012-04-01 10:58:27 31107 Lamont & Mt Pleasant NW 31610 Eastern Market / 7th & North Carolina Ave SE W01154 Member
I am sure it is not difficult for those few out there to imagine and even implement a better solution than SQL provided you have the motive and financial backup. In fact as I mentioned in this article that partly exists. But it is certainly far more challenging and difficult to loose the safety of your daily job and/or income for years to imagine and consequently strive to give flesh and bones to a truly NoSQL API that can be more productive, efficient and innovative to work with than what already exists. Keep an eye for the next release of TRIADB project and who knows you might be convinced that this is truly a unique and valuable tool to use.
Therefore, we need to investigate systems that combine relational algebra and linear algebra in a richer query paradigm, potentially as extensions to SQL. Two “holy grails” should continue to stay on our agenda. First, we must always explore any novel ideas to reduce the impedance mismatch between application development and writing database queries. Second, we must continue to find ways to make database systems less rigid (e.g., flexible schema evolution) without significantly sacrificing their performance 1