Predii’s Director of Engineering, Hieu Ho is giving us an inside look at the journey he recently had when he started the process of selecting a new datastore for Predii. In Part 1 of his, Selecting a Datastore series, Hieu discussed the top 3 reasons Predii chose the datastore we did. In Part 2 of the series he delves into the complexities and quirks that can come along with any datastore and how Predii determined which would work best for our application.
Selecting a Datastore: Part 2, What’s a Deal Maker, Deal Breaker, or No Big Deal
In evaluating software, the intricacies and quirks can start to emerge along with experimentation and use of the features. As with any relationship, you have to determine whether these quirks are deal breakers, deal makers, or no big deal. This is not always easy to evaluate so we wanted to share our experience with you.
Top 3 Quirks We Found When Selecting a Datastore:1. Query Model and Language
2. Indexes and Iterating Datasets
Mongo has built-in indexes, but they took too long and did not meet our needs. Like with any database, the indexes took up disk space which grew wildly out of control. Re-indexing was a pain point as well, as that took over the system. Whenever we need to re-index, we would do so on a copy of the database, and then switch over.
As our indexing needs became more complicated, and our search requirements needed fine tuning and precision control, we opted to add an additional indexing layer with Apache Lucene. This required us to create indexes outside of the database, but at least it can be done incrementally and without locking the database. Lucene allows for facet searches, which was a way to slice and dice the data in different views and angles, a plus for our client views. Later, we transitioned to Elasticsearch, which is an engine built on top of Lucene, but allowing for distributed nodes and clustering servers. More on this topic to come. Almost a deal breaker but we were able to find a work around that suited our needs.
3. Replica Sets
In any high-available system, having a duplicate copy of the data means that, should the master node fail, you can be up and running in no time. Replica sets in Mongo are similar, in that the data is duplicated onto the slaves of the replica set. This is done by actively checking operation logs for update commands on the master node, and performing the same commands on the replica set. For our systems, which we perform large batch processing, this caused the replica nodes to stall, as the commands executed on them would basically duplicate the time it took on the original node. And the whole “eventual consistency” of querying a replica node when an update was going on gave undesired results.
Our system did not require a live database though, so eventually there was no need for a replica set. Real-time results were not a requirement, so a read-only copy served well. Actually, all the client facing data were from read-only nodes. By setting up a database set for processing, the client data is not impacted. When processing finished, the client facing hosts would hot-swap to the new data, while allowing us to replicate in the backend. Synchronizing the replica sets became not such a big deal.
As with any “out-of-the-box” solutions, there are various challenges you will discover along the way, as well as lessons and best practices for each. We found that mixing and matching solutions for our requirements served us best. Hope you are successful in your endeavors, and happy coding!
Keep an eye out soon for part 3 of Hieu's, Selecting a Datastore series, which will discuss distributed architecture.