Thursday, July 9, 2009

Choosing between SQL and Non-SQL

There is recently a lot discussion between various Non-SQL model which relax data consistency for high scalability. There is a recent meetup in San Francisco about this.

Various model is discussed, from Memcached, Distributed Hashtable, Column-oriented data store with corresponding data access model. Many of the Non-SQL stores focus in scalability
  1. Partition data across lots of machines
  2. Each machine operate independently (ie: minimal co-ordination with each other)

With such distributed environment, providing tight data integrity is a huge challenge, therefore some degree of consistency is lost. (but depends on the implementation, what is being lost is not the same). I seriously doubt that it is feasible to add data partitioning to traditional RDBMS without changing its data integrity guarantee.

From a data access pattern perspective, memcache represents one end of the spectrum where the access is ad/hoc, towards a small subset of a large data set, and user to data collocation is completely unpredictable. It presents the biggest challenge to ACID guarantee. Of course, one can theorectically use 2PC, PAXOs protocols but no one is using that because it will be too slow. Whenever one is using Memcached, then the application must have a pretty relaxed data consistency requirement.

At the other end of the spectrum, if we can "confine" the data access model, then we can build a highly specialized data layout plan and with careful execution plan, we can even eliminating the concurrency scenario. Under this approach, very high scalability can be achieved. Map/Reduce is a typical example in this approach. The challenge is that an extra algorithm translation step is required to rearrange the application into the confined MapReduce model.

Nevertheless, I don't think it is too meaningful to compare different DB models without looking at different dimensions which this thread hasn't discussed.

What is the data access patterns ?
  • Batch-oriented, require scanning of the whole data set (e.g. Map/Reduce style)
  • Ad/hoc, unpredictable access (with concurrent access) to small, unknown data subset
  • Explicitly provided key-based lookup within a collection
  • Value-based lookup (criteria as boolean expression) within a collection
  • How criteria is specified ? (equality, greater-than/less-than, prefix-match, similarity-match)
  • Inner/Outer joins across multiple collections

The "layout structure" of data and corresponding index plays a big role.
  • key/value store
  • Relational store
  • Column-oriented store

What is the data consistency requirement ?
  • Everyone at anytime must see absolutely updated information
  • It is OK to see outdated information as long as everyone see the same picture
  • It is OK to see outdated information if the staleness is time-bound (fresh within 20 minutes)

The "data update" mechanism plays a big role
  • Isolation mechanism (LOCK, MVCC)
  • Update Co-oridination mechanism (2PC, PAXOS, Async Gossip)
More research is needed in various tradeoff decision and how it affects scalability. It will take a while for the industry to figure out the best model for various application types.

No comments: