DBMS Musings: If all these new DBMS technologies are so scalable, why are Oracle and DB2 still on top of TPC-C? A roadmap to end their dominance.

Wednesday, May 16, 2012

If all these new DBMS technologies are so scalable, why are Oracle and DB2 still on top of TPC-C? A roadmap to end their dominance.

(This post is coauthored by Alexander Thomson and Daniel Abadi)
In the last decade, database technology has arguably progressed furthest along the scalability dimension. There have been hundreds of research papers, dozens of open-source projects, and numerous startups attempting to improve the scalability of database technology. Many of these new technologies have been extremely influential---some papers have earned thousands of citations, and some new systems have been deployed by thousands of enterprises.

So let’s ask a simple question: If all these new technologies are so scalable, why on earth are Oracle and DB2 still on top of the TPC-C standings? Go to the TPC-C Website with the top 10 results in raw transactions per second. As of today (May 16th, 2012), Oracle 11g is used for 3 of the results (including the top result), 10g is used for 2 of the results, and the rest of the top 10 is filled with various versions of DB2. How is technology designed decades ago still dominating TPC-C? What happened to all these new technologies with all these scalability claims?

The surprising truth is that these new DBMS technologies are not listed in the TPC-C top ten results not because that they do not care enough to enter, but rather because they would not win if they did.

To understand why this is the case, one must understand that scalability does not come for free. Something must be sacrificed to achieve high scalability. Today, there are three major categories of tradeoff that can be exploited to make a system scale. The new technologies basically fall into two of these categories; Oracle and DB2 fall into a third. And the later parts of this blog post describes research from our group at Yale that introduces a fourth category of tradeoff that provides a roadmap to end the dominance of Oracle and DB2.

These categories are:

(1) Sacrifice ACID for scalability. Our previous post on this topic discussed this in detail. Basically we argue that a major class of new scalable technologies fall under the category of “NoSQL” which achieves scalability by dropping ACID guarantees, thereby allowing them to eschew two phase locking, two phase commit, and other impediments to concurrency and processor independence that hurt scalability. All of these systems that relax ACID are immediately ineligible to enter the TPC-C competition since ACID guarantees are one of TPC-C’s requirements. That’s why you don’t see NoSQL databases in the TPC-C top 10---they are immediately disqualified.

(2) Reduce transaction flexibility for scalability. There are many so-called “NewSQL” databases that claim to be both ACID-compliant and scalable. And these claims are true---to a degree. However, the fine print is that they are only linearly scalable when transactions can be completely isolated to a single “partition” or “shard” of data. While these NewSQL databases often hide the complexity of sharding from the application developer, they still rely on the shards to be fairly independent. As soon as a transaction needs to span multiple shards (e.g., update two different user records on two different shards in the same atomic transaction), then these NewSQL systems all run into problems. Some simply reject such transactions. Others allow them, but need to perform two phase commit or other agreement protocols in order to ensure ACID compliance (since each shard may fail independently). Unfortunately, agreement protocols such as two phase commit come at a great scalability cost (see our 2010 paper that explains why). Therefore, NewSQL databases only scale well if multi-shard transactions (also called “distributed transactions” or “multi-partition transactions”) are very rare. Unfortunately for these databases, TPC-C models a fairly reasonable retail application where customers buy products and the inventory needs to be updated in the same atomic transaction. 10% of TPC-C New Order transactions involve customers buying products from a “remote” warehouse, which is generally stored in a separate shard. Therefore, even for basic applications like TPC-C, NewSQL databases lose their scalability advantages. That’s why the NewSQL databases do not enter TPC-C results --- even just 10% of multi-shard transactions causes their performance to degrade rapidly.

(3) Trade cost for scalability. If you use high end hardware, it is possible to get stunningly high transactional throughput using old database technologies that don’t have shared-nothing horizontally scalability. Oracle tops TPC-C with an incredibly high throughput of 500,000 transactions per second. There exists no application in the modern world that produces more than 500,000 transactions per second (as long as humans are initiating the transactions---machine-generated transactions are a different story). Therefore, Oracle basically has all the scalability that is needed for human scale applications. The only downside is cost---the Oracle system that is able to achieve 500,000 transactions per second costs a prohibitive $30,000,000!

Since the first two types of tradeoffs are immediate disqualifiers for TPC-C, the only remaining thing to give up is cost-for-scale, and that’s why the old database technologies are still dominating TPC-C. None of these new technologies can handle both ACID and 10% remote transactions.

A fourth approach...

TPC-C is a very reasonable application. New technologies should be able to handle it. Therefore, at Yale we set out to find a new dimension in this tradeoff space that could allow a system to handle TPC-C at scale without costing $30,000,000. Indeed, we are presenting a paper next week at SIGMOD (see the full paper) that describes a system that can achieve 500,000 ACID-compliant TPC-C New Order transactions per second using commodity hardware in the cloud. The cost to us to run these experiments was less than $300 (of course, this is renting hardware rather than buying, so it’s hard to compare prices --- but still --- a factor of 100,000 less than $30,000,000 is quite large).

Calvin, our prototype system designed and built by a large team of researchers at Yale that include Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, Anton Petrov, Michael Giuffrida, and Aaron Segal (in addition to the authors of this blog post), explores a tradeoff very different from the three described above. Calvin requires all transactions to be executed fully server-side and sacrifices the freedom to non-deterministically abort or reorder transactions on-the-fly during execution. In return, Calvin gets scalability, ACID-compliance, and extremely low-overhead multi-shard transactions over a shared-nothing architecture. In other words, Calvin is designed to handle high-volume OLTP throughput on sharded databases on cheap, commodity hardware stored locally or in the cloud. Calvin significantly improves the scalability over our previous approach to achieving determinism in database systems.

Scaling ACID

The key to Calvin’s strong performance is that it reorganizes the transaction execution pipeline normally used in DBMSs according to the principle: do all the "hard" work before acquiring locks and beginning execution. In particular, Calvin moves the following stages to the front of the pipeline:

Replication. In traditional systems, replicas agree on each modification to database state only after some transaction has made the change at some "master" replica. In Calvin, all replicas agree in advance on the sequence of transactions that they will (deterministically) attempt to execute.
Agreement between participants in distributed transactions. Database systems traditionally use two-phase commit (2PC) to handle distributed transactions. In Calvin, every node sees the same global sequence of transaction requests, and is able to use this already-agreed-upon information in place of a commit protocol.
Disk accesses. In our VLDB 2010 paper, we observed that deterministic systems performed terribly in disk-based environments due to holding locks for the 10ms+ duration of reading the needed data from disk, since they cannot reorder conflicting transactions on the fly. Calvin gets around this setback by prefetching into memory all records that a transaction will need during the replication phase---before locks are even acquired.

As a result, each transaction’s user-specified logic can be executed at each shard with an absolute minimum of runtime synchronization between shards or replicas to slow it down, even if the transaction’s logic requires it to access records at multiple shards. By minimizing the time that locks are held, concurrency can be greatly increased, thereby leading to near-linear scalability on a commodity cluster of machines.

Strongly consistent global replication

Calvin’s deterministic execution semantics provide an additional benefit: replicating transactional input is sufficient to achieve strongly consistent replication. Since replicating batches of transaction requests is extremely inexpensive and happens before the transactions acquire locks and begin executing, Calvin’s transactional throughput capacity does not depend at all on its replication configuration.

In other words, not only can Calvin can run 500,000 transactions per second on 100 EC2 instances in Amazon’s US East (Virginia) data center, it can maintain strongly-consistent, up-to-date 100-node replicas in Amazon’s Europe (Ireland) and US West (California) data centers---at no cost to throughput.

Calvin accomplishes this by having replicas perform the actual processing of transactions completely independently of one another, maintaining strong consistency without having to constantly synchronize transaction results between replicas. (Calvin’s end-to-end transaction latency does depend on message delays between replicas, of course---there is no getting around the speed of light.)

Flexible data model

So where does Calvin fall in the OldSQL/NewSQL/NoSQL trichotomy?

Actually, nowhere. Calvin is not a database system itself, but rather a transaction scheduling and replication coordination service. We designed the system to integrate with any data storage layer, relational or otherwise. Calvin allows user transaction code to access the data layer freely, using any data access language or interface supported by the underlying storage engine (so long as Calvin can observe which records user transactions access). The experiments presented in the paper use a custom key-value store. More recently, we’ve hooked Calvin up to Google’s LevelDB and added support for SQL-based data access within transactions, building relational tables on top of LevelDB’s efficient sorted-string storage.

From an application developer’s point of view, Calvin’s primary limitation compared to other systems is that transactions must be executed entirely server-side. Calvin has to know in advance what code will be executed for a given transaction. Users may pre-define transactions directly in C++, or submit arbitrary Python code snippets on-the-fly to be parsed and executed as transactions.

For some applications, this requirement of completely server-side transactions might be a difficult limitation. However, many applications prefer to execute transaction code on the database server anyway (in the form of stored procedures), in order to avoid multiple round trip messages between the database server and application server in the middle of a transaction.

If this limitation is acceptable, Calvin presents a nice alternative in the tradeoff space to achieving high scalability without sacrificing ACID or multi-shard transactions. Hence, we believe that our SIGMOD paper may present a roadmap for overcoming the scalability dominance of the decades-old database solutions on traditional OLTP workloads. We look forward to debating the merits of this approach in the weeks ahead (and Alex will be presenting the paper at SIGMOD next week).

43 comments:

Boris LetochaMay 16, 2012 at 8:17 AM
Hi,

Thanks for nice write up, it helped me assure that I understand paper about Calvin correctly.

Now some fearless questions:
Will be Calvin Open Sourced?
Or at least could be published sample code of some transaction in C++ for Calvin to see how difficult it is to write them?

Thanks.
ReplyDelete
Replies
Chris MarisicMay 16, 2012 at 11:22 AM
Your post ignores that there is an exception to almost everything you state as "costs" to scalability.

RavenDB.

1. RavenDB is fully ACID for writes. It is BASE for reads.

2. It supports transactions spanning multiple shards.

3. It is one of the lowest cost to entry systems that is enterprise quality and durable.

4. I don't really know what this actually is about.

So to summarize, RavenDB has every single positive you list on this page, and zero negatives.

Also those performance metrics are absolutely meaningless in regards to RDBMS because to get that performance requires models that aren't used for real. Except in edge cases of extreme performance optimization. To get blazing fast RDBMS, ironically you must use it... NON-RELATIONALLY. Relational DOES NOT scale. Period.

RavenDB also supports many of the scenarios that would require joins, except it doesn't actually need to join at all.
ReplyDelete
Replies
UnknownMay 16, 2012 at 2:53 PM
How does Calvin handle correlated transactions?

ie. A transaction that writes X based on the result of a read of Y.

It seems like you get into a chicken and egg problem. You read Y (without locking it), figure out what X is to be, prep X, and then acquire locks. But between your initial read of Y, and your attempt to write X, there's nothing to prevent another user from modifying Y, in which case your transaction isn't isolated.
ReplyDelete
Replies
Prasad ChittaMay 17, 2012 at 1:35 AM
Surely a new impressive view of transaction processing scalability and replication problems.

This approach is suitable for OLTP transactions where the size of transaction is small. Also each transaction should predefined and coded on the server.

But a scalability of 500,000 txn per second on a 100 commodity nodes is surely promising along with the capability to remotely replicate.

Thanks for sharing.
ReplyDelete
Replies
UnknownMay 17, 2012 at 5:33 AM
Fantastic! This is very exciting. :) Thank you for sharing, I will be keeping my eye on this.
ReplyDelete
Replies
Harish MallipeddiMay 17, 2012 at 3:29 PM
If the replicas process the transactions at their own pace, I'd have to hit the same replica if I wanted read-after-consistency?

Maybe I'm mistaken but Calvin sounds a lot like VoltDB but extended to work with large datasets by cleverly faulting in required pages from disk into memory prior to acquiring locks. VoltDB doesn't deal with this case by just assuming everything fits in memory.
ReplyDelete
Replies
Alice GomezMay 19, 2012 at 8:34 AM
Nice work! Can I ask about the elephant in the room? Why is that the sequencer is not a bottleneck in Calvin? If all the transactions go through a single sequencer, it is certainly becomes the bottleneck. If you partition the transactions among multiple sequencers, then synchronization among the sequencers does not scale. Can you please comment on this? Thanks.
ReplyDelete
Replies
KevinMay 19, 2012 at 9:18 AM
Would you mind clarifying what you mean here?

"There exists no application in the modern world that produces more than 500,000 transactions per second (as long as humans are initiating the transactions---machine-generated transactions are a different story)."

Why are you assuming that there is no application that needs more than 500k transactions per second? A single user surely can't initiate 500k transactions a second but I think google/facebook/amazon etc. have a pretty good usecase for needing to scale past 500k transactions per second with user input.
ReplyDelete
Replies
alexMay 26, 2012 at 12:31 PM
This comment has been removed by the author.
ReplyDelete
Replies
alexMay 26, 2012 at 12:32 PM
Hi,

This looks fantastic - really great effort. I work in finance and we have some fairly difficult transactional requirements.

Your principles are quite similar to those used in some event sourcing architectures (for example the LMAX Exchange) and also a very good product by IBM called LLM (Low Latency Messaging) - specifically the RCMS component. It provides sequencing, a global ordering across a set of partitions, and assumes determinism in your transactions (but has mechanisms to support (limited numbers) of non-deterministic transactions). Importantly they use a reliable multicast messaging for replication to the tier. This may or may not be of interest to you.

I wasn't quite sure how you would move partitions in the case of failure of some part of the cluster fails?

Thanks again. Very interesting.
ReplyDelete
Replies
Masood MortazaviAugust 17, 2012 at 10:36 AM
It is not clear -- reading the blog -- how those TPC-C transactions that would affect multiple shards (presumably multiple TPC-C "warehouses", assuming you are using these as shards) are handled . . .
ReplyDelete
Replies
serendipityAugust 28, 2012 at 9:29 AM
Hi Daniel and Alexander! Very cool work! I had a question regarding the evaluation of this paper as well as your earlier paper that argues for determinism (as well as the original "The End of ... Rewrite" paper). In all these papers, you are using TPCC without the wait time and think time. Can you point me to the key reason as to why is this acceptable? Seemingly, this approach increases the concurrency in the workload making your results look better than they would with less concurrency. Furthermore, can you point me how many clients did you use per warehouse? Was it 10 clients/warehouse or are you guys reporting the max throughput obtained by varying the number of clients. Thanks a lot in advance for your reply!

Best,

Prince
ReplyDelete
Replies
S931CoderOctober 10, 2012 at 5:27 PM
I assume you are requiring solid state drives in your tests...
ReplyDelete
Replies
UnknownJune 4, 2013 at 12:41 PM
"In Calvin, all replicas agree in advance on the sequence of transactions that they will (deterministically) attempt to execute."

what if a replica failed to commit the transaction and hence rolledback the changes?
is the transaction on the (first hit server) master also rolledback?
how is the rollback possible without suffering wait-delay-for-response-from-replica?

If master does not rollback the transaction even if the transaction failed in one/more of the replicas,
then no replica is guaranteed to be consistent with the master [in effect no node is consistent with data
replicated from other nodes] and hence is not reliable
ReplyDelete
Replies
Arturo HernandezAugust 4, 2013 at 5:58 PM
I can't wait to read the documentation or the code, when it is made available. Since Alexander Thomson said that the database was placed in RAM, I think it is likely you used the High-Memory Cluster instances which run at $10,000 for a 3 year contract. For a 100 nodes the total comes up to $1,000,000. that is the ceiling and it still is 50 times less that the 50 mill for the equipment used in the winning TPC benchmark. Or did you use a micro instance. i am just curious. Comparative cost data info will eventually come out, so I may just need to wait.
ReplyDelete
Replies

Add comment

DBMS Musings

Wednesday, May 16, 2012

If all these new DBMS technologies are so scalable, why are Oracle and DB2 still on top of TPC-C? A roadmap to end their dominance.

43 comments:

Blog Archive

Followers