Friday, 6 May 2016

Polyglot Persistence in NoSQL Space






by Sarang Nagmote



Category - Databases
More Information & Updates Available at: http://http://insightanalytics.co.in




Relational databases have been around for a long time, developers tend to use them often and the provided feature set is familiar. It is enough to be familiar with SQL to use them. The design of relational databases is doing a great job in hiding the internals from users. Questions like how data is stored on disk, how write path really works underneath the hood, is the database caching and how are often too advanced for regular database usage.
We live in a data intensive world, and in the last couple of years we have witnessed an expansion of social networks and IoT, everything is connected to the internet and everything emits and receives events. That amount of data needs to be handled in a different way than in the relational world. We see more and more requests to store unstructured data just for the sake of storing with the idea that value will emerge from it come later on. As of 2012, 2.5 exabytes (2.5×1018) are being stored every day, while 90% of stored data in the world has been gathered in the last two years. Those are just some of the reasons why we need a shift in thinking about storage and why NoSQL is not hype anymore.
In the NoSQL world, things are a bit different. Databases are not general-purpose, they are often built to solve a specific use case. They work great with that use case but not so good with other problems. They are often a choice for big data projects where we have high, non-functional requirements, and it is almost impossible to work with NoSQL databases without knowing the internals.

NoSQL Space

We have four major types in the NoSQL space: document databases, key-value stores, column family, and graph.
Document databases are probably the most popular ones, mainly because of MongoDB which was one of first NoSQL databases entering the top 10 storage engines on DB-Engine rankings. MongoDB is rich in features, has master/slave replication with built-in sharding and uses memory mapped files for data storage. It is great for prototyping because of flexibility when storing and querying data and it is a good substitute in most places where PostgreSQL or other RDBMS would fit in but predefined schema holds you back. Other databases of this type are CouchDB document store with master-master replication and eventual consistency, and Couchbase which is both key-value and document store, provides JSON API which is useful as storage engine for client-server apps.
When you think of key-value store, you can think of distributed implementation of Map. Redis and HazelCast are databases of this type. Redis is a blazing fast database written in C. It has master-slave replication with automatic failover. It is the best choice for rapidly changing data with a foreseeable database size (should fit mostly in memory). An example of usage could be session storage, shopping cart for e-commerce sites, or real-time analytics. HazelCast is an in-memory distributed storage and computing platform offering out-of-the-box distributed implementations for many APIs that Java developers are familiar with, such as Map, Set, List, Semaphore, Executor, etc., it can be seen as a plug-in replace for tools like EhCache, Redis, JCache.
Column family databases contain columns of related data, stored as key-value pairs. The most popular databases of this type are Cassandra and HBase. Cassandra originated from Amazon’s DynamoDB and Google BigTable and it is great for storing huge amounts of data. All nodes are equal, it is shared-nothing architecture, masterless with no single point of failure. CQL (Cassandra Query Language) is similar to SQL. It is best used for web analytics, hit counting, transaction logging, and storage for sensor data. HBase is a part of the Hadoop framework and uses its HDFS file system as storage. It is best used for Hadoop’s Map/Reduce, analyzing log data, and in any place where scanning huge, two-dimensional join-less tables is a requirement.
In graph databases, relations between two entities are more important than entities themselves. The main representatives of this type are Neo4J and TitanDB. Neo4J is a graph database written in Java, it uses pattern-matching-based query language ("Cypher") but can use "Gremlin" graph traversal language as well. It is best used for graph-style, rich or complex, interconnected data such as searching routes in social relations, public transport links, road maps, or network topologies. TitanDB is a scalable graph framework optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster. It can use various storage backends: Cassandra, HBase, BerkeleyDB and provide great support in integration with mainstream frameworks (Hadoop, Spark, ElasticSearch, Solr).

Polyglot Persistence Approach

It is not uncommon these days to see a combination of relational and nonrelational databases within the same project, even a combination of a couple of different databases of the same type. Microservice architecture influenced this a lot, since each small service is the owner of its data, everything is behind API and separation of services is based upon the use case, so it is only natural to choose the best storage for the job.
Polyglot persistence - Martin Fowler
Here is an example of an application with multiple storage engines taken from Martin Fowlers blog. User session, as temporary fast data optimized for frequent reads, is placed in Redis. Financial data, which is naturally relational is stored in RDBMS together with reports. Shopping cart has a similar nature to the session, it is temporary fast access data so it is stored in a key-value store (Riak this time). Neo4J as a graph database has great support for recommendation engines, user similarities are an important factor for giving recommendations to customers, so this is the prime reason for its choice. Product catalog has the nature of a document, there are a lot of dynamic criteria for querying this data and MongoDB is great with its flexibility and speed for this use case. Cassandras killer feature is time series data, it has great integration with Spark and Solr so it is the best choice for analytics and user activity logs.

Conclusion

Polyglot persistence is not a free ride, it comes with a price and the price is complexity of the system. A couple of years ago it was enough to know one programming language and one storage engine in order to build a system, but nowadays you need to be a polyglot in every aspect. That’s why you need to be familiar with the cost of each choice and to introduce a new technology only when the benefits from it are much higher than the complexity it brings to the table. The scariest thing now when everybody talks about the polyglot approach, is to use the new technology just for the sake of it.
Because of the nature of NoSQL, which has been built to solve a specific use case, be prepared to dig a bit deeper and learn how a database functions underneath. The problems of Big Data and distributed systems are much different than single instance problems and being unfamiliar with the technology can lead to problems at a much bigger scale.
Do your homework, explore the possible solutions to your problem, make a decision, and get familiar with the tool you choose.

No comments:

Post a Comment