Selecting a Datastore: Part 2, What’s a Deal Maker, Deal Breaker, or No Big Deal

iStock-M.jpg

Predii’s Director of Engineering, Hieu Ho is giving us an inside look at the journey he recently had when he started the process of selecting a new datastore for Predii. In Part 1 of his, Selecting a Datastore series, Hieu discussed the top 3 reasons Predii chose the datastore we did. In Part 2 of the series he delves into the complexities and quirks that can come along with any datastore and how Predii determined which would work best for our application.


Selecting a Datastore: Part 2, What’s a Deal Maker, Deal Breaker, or No Big Deal

In evaluating software, the intricacies and quirks can start to emerge along with experimentation and use of the features.  As with any relationship, you have to determine whether these quirks are deal breakers, deal makers, or no big deal.  This is not always easy to evaluate so we wanted to share our experience with you.

Top 3 Quirks We Found When Selecting a Datastore:

1. Query Model and Language

The transition to Mongo and No-SQL from databases such as Oracle, MySQL and, Postgres took some getting used to. Acquiring data wasn’t a simple select-from-where statement, but accessing an API using JSON commands and setting appropriate attributes. Take for example querying values for a single column.  Where a transactional database will return a list of values given a select statement, in Mongo, you are asking the command-line interface (CLI) to perform a “find”, and then projecting the appropriate attribute.  Even after that, the result is in JSON which must be parsed, usually with the built-in JavaScript.  For us, it was a matter of reusing queries that worked as templates, and utilizing the client Java API in many of our applications.  The CLI was meant more for quick views and testing purposes.  Even our reports were wrapped around scripts (JavaScript, Bash, or Java-based), so our requirements to write Mongo queries by hand were minimal. We decided the quirks we found with query model and language were, no big deal.


2. Indexes and Iterating Datasets

Mongo has built-in indexes, but they took too long and did not meet our needs.  Like with any database, the indexes took up disk space which grew wildly out of control.  Re-indexing was a pain point as well, as that took over the system.  Whenever we need to re-index, we would do so on a copy of the database, and then switch over.  

As our indexing needs became more complicated, and our search requirements needed fine tuning and precision control, we opted to add an additional indexing layer with Apache Lucene.  This required us to create indexes outside of the database, but at least it can be done incrementally and without locking the database.  Lucene allows for facet searches, which was a way to slice and dice the data in different views and angles, a plus for our client views.  Later, we transitioned to Elasticsearch, which is an engine built on top of Lucene, but allowing for distributed nodes and clustering servers.  More on this topic to come. Almost a deal breaker but we were able to find a work around that suited our needs.


3. Replica Sets

In any high-available system, having a duplicate copy of the data means that, should the master node fail, you can be up and running in no time.  Replica sets in Mongo are similar, in that the data is duplicated onto the slaves of the replica set.  This is done by actively checking operation logs for update commands on the master node, and performing the same commands on the replica set.  For our systems, which we perform large batch processing, this caused the replica nodes to stall, as the commands executed on them would basically duplicate the time it took on the original node.  And the whole “eventual consistency” of querying a replica node when an update was going on gave undesired results.

Our system did not require a live database though, so eventually there was no need for a replica set.  Real-time results were not a requirement, so a read-only copy served well.  Actually, all the client facing data were from read-only nodes.  By setting up a database set for processing, the client data is not impacted.  When processing finished, the client facing hosts would hot-swap to the new data, while allowing us to replicate in the backend.  Synchronizing the replica sets became not such a big deal.


As with any “out-of-the-box” solutions, there are various challenges you will discover along the way, as well as lessons and best practices for each.  We found that mixing and matching solutions for our requirements served us best.  Hope you are successful in your endeavors, and happy coding!

 Keep an eye out soon for part 3 of Hieu's, Selecting a Datastore series, which will discuss distributed architecture.