Wednesday, May 16, 2012

If all these new DBMS technologies are so scalable, why are Oracle and DB2 still on top of TPC-C? A roadmap to end their dominance.

(This post is coauthored by Alexander Thomson and Daniel Abadi)
In the last decade, database technology has arguably progressed furthest along the scalability dimension. There have been hundreds of research papers, dozens of open-source projects, and numerous startups attempting to improve the scalability of database technology. Many of these new technologies have been extremely influential---some papers have earned thousands of citations, and some new systems have been deployed by thousands of enterprises.

So let’s ask a simple question: If all these new technologies are so scalable, why on earth are Oracle and DB2 still on top of the TPC-C standings? Go to the TPC-C Website with the top 10 results in raw transactions per second. As of today (May 16th, 2012), Oracle 11g is used for 3 of the results (including the top result), 10g is used for 2 of the results, and the rest of the top 10 is filled with various versions of DB2. How is technology designed decades ago still dominating TPC-C? What happened to all these new technologies with all these scalability claims?

The surprising truth is that these new DBMS technologies are not listed in the TPC-C top ten results not because that they do not care enough to enter, but rather because they would not win if they did.

To understand why this is the case, one must understand that scalability does not come for free. Something must be sacrificed to achieve high scalability. Today, there are three major categories of tradeoff that can be exploited to make a system scale. The new technologies basically fall into two of these categories; Oracle and DB2 fall into a third. And the later parts of this blog post describes research from our group at Yale that introduces a fourth category of tradeoff that provides a roadmap to end the dominance of Oracle and DB2.

These categories are:

(1) Sacrifice ACID for scalability. Our previous post on this topic discussed this in detail. Basically we argue that a major class of new scalable technologies fall under the category of “NoSQL” which achieves scalability by dropping ACID guarantees, thereby allowing them to eschew two phase locking, two phase commit, and other impediments to concurrency and processor independence that hurt scalability. All of these systems that relax ACID are immediately ineligible to enter the TPC-C competition since ACID guarantees are one of TPC-C’s requirements. That’s why you don’t see NoSQL databases in the TPC-C top 10---they are immediately disqualified.

(2) Reduce transaction flexibility for scalability. There are many so-called “NewSQL” databases that claim to be both ACID-compliant and scalable. And these claims are true---to a degree. However, the fine print is that they are only linearly scalable when transactions can be completely isolated to a single “partition” or “shard” of data. While these NewSQL databases often hide the complexity of sharding from the application developer, they still rely on the shards to be fairly independent. As soon as a transaction needs to span multiple shards (e.g., update two different user records on two different shards in the same atomic transaction), then these NewSQL systems all run into problems. Some simply reject such transactions. Others allow them, but need to perform two phase commit or other agreement protocols in order to ensure ACID compliance (since each shard may fail independently). Unfortunately, agreement protocols such as two phase commit come at a great scalability cost (see our 2010 paper that explains why). Therefore, NewSQL databases only scale well if multi-shard transactions (also called “distributed transactions” or “multi-partition transactions”) are very rare. Unfortunately for these databases, TPC-C models a fairly reasonable retail application where customers buy products and the inventory needs to be updated in the same atomic transaction. 10% of TPC-C New Order transactions involve customers buying products from a “remote” warehouse, which is generally stored in a separate shard. Therefore, even for basic applications like TPC-C, NewSQL databases lose their scalability advantages. That’s why the NewSQL databases do not enter TPC-C results --- even just 10% of multi-shard transactions causes their performance to degrade rapidly.

(3) Trade cost for scalability. If you use high end hardware, it is possible to get stunningly high transactional throughput using old database technologies that don’t have shared-nothing horizontally scalability. Oracle tops TPC-C with an incredibly high throughput of 500,000 transactions per second. There exists no application in the modern world that produces more than 500,000 transactions per second (as long as humans are initiating the transactions---machine-generated transactions are a different story). Therefore, Oracle basically has all the scalability that is needed for human scale applications. The only downside is cost---the Oracle system that is able to achieve 500,000 transactions per second costs a prohibitive $30,000,000!

Since the first two types of tradeoffs are immediate disqualifiers for TPC-C, the only remaining thing to give up is cost-for-scale, and that’s why the old database technologies are still dominating TPC-C. None of these new technologies can handle both ACID and 10% remote transactions.

A fourth approach...

TPC-C is a very reasonable application. New technologies should be able to handle it. Therefore, at Yale we set out to find a new dimension in this tradeoff space that could allow a system to handle TPC-C at scale without costing $30,000,000. Indeed, we are presenting a paper next week at SIGMOD (see the full paper) that describes a system that can achieve 500,000 ACID-compliant TPC-C New Order transactions per second using commodity hardware in the cloud. The cost to us to run these experiments was less than $300 (of course, this is renting hardware rather than buying, so it’s hard to compare prices --- but still --- a factor of 100,000 less than $30,000,000 is quite large).

Calvin, our prototype system designed and built by a large team of researchers at Yale that include Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, Anton Petrov, Michael Giuffrida, and Aaron Segal (in addition to the authors of this blog post), explores a tradeoff very different from the three described above. Calvin requires all transactions to be executed fully server-side and sacrifices the freedom to non-deterministically abort or reorder transactions on-the-fly during execution. In return, Calvin gets scalability, ACID-compliance, and extremely low-overhead multi-shard transactions over a shared-nothing architecture. In other words, Calvin is designed to handle high-volume OLTP throughput on sharded databases on cheap, commodity hardware stored locally or in the cloud. Calvin significantly improves the scalability over our previous approach to achieving determinism in database systems.

Scaling ACID

The key to Calvin’s strong performance is that it reorganizes the transaction execution pipeline normally used in DBMSs according to the principle: do all the "hard" work before acquiring locks and beginning execution. In particular, Calvin moves the following stages to the front of the pipeline:

  • Replication. In traditional systems, replicas agree on each modification to database state only after some transaction has made the change at some "master" replica. In Calvin, all replicas agree in advance on the sequence of transactions that they will (deterministically) attempt to execute.
  • Agreement between participants in distributed transactions. Database systems traditionally use two-phase commit (2PC) to handle distributed transactions. In Calvin, every node sees the same global sequence of transaction requests, and is able to use this already-agreed-upon information in place of a commit protocol.
  • Disk accesses. In our VLDB 2010 paper, we observed that deterministic systems performed terribly in disk-based environments due to holding locks for the 10ms+ duration of reading the needed data from disk, since they cannot reorder conflicting transactions on the fly. Calvin gets around this setback by prefetching into memory all records that a transaction will need during the replication phase---before locks are even acquired.

As a result, each transaction’s user-specified logic can be executed at each shard with an absolute minimum of runtime synchronization between shards or replicas to slow it down, even if the transaction’s logic requires it to access records at multiple shards. By minimizing the time that locks are held, concurrency can be greatly increased, thereby leading to near-linear scalability on a commodity cluster of machines.

Strongly consistent global replication

Calvin’s deterministic execution semantics provide an additional benefit: replicating transactional input is sufficient to achieve strongly consistent replication. Since replicating batches of transaction requests is extremely inexpensive and happens before the transactions acquire locks and begin executing, Calvin’s transactional throughput capacity does not depend at all on its replication configuration.

In other words, not only can Calvin can run 500,000 transactions per second on 100 EC2 instances in Amazon’s US East (Virginia) data center, it can maintain strongly-consistent, up-to-date 100-node replicas in Amazon’s Europe (Ireland) and US West (California) data centers---at no cost to throughput.

Calvin accomplishes this by having replicas perform the actual processing of transactions completely independently of one another, maintaining strong consistency without having to constantly synchronize transaction results between replicas. (Calvin’s end-to-end transaction latency does depend on message delays between replicas, of course---there is no getting around the speed of light.)

Flexible data model

So where does Calvin fall in the OldSQL/NewSQL/NoSQL trichotomy?

Actually, nowhere. Calvin is not a database system itself, but rather a transaction scheduling and replication coordination service. We designed the system to integrate with any data storage layer, relational or otherwise. Calvin allows user transaction code to access the data layer freely, using any data access language or interface supported by the underlying storage engine (so long as Calvin can observe which records user transactions access). The experiments presented in the paper use a custom key-value store. More recently, we’ve hooked Calvin up to Google’s LevelDB and added support for SQL-based data access within transactions, building relational tables on top of LevelDB’s efficient sorted-string storage.

From an application developer’s point of view, Calvin’s primary limitation compared to other systems is that transactions must be executed entirely server-side. Calvin has to know in advance what code will be executed for a given transaction. Users may pre-define transactions directly in C++, or submit arbitrary Python code snippets on-the-fly to be parsed and executed as transactions.

For some applications, this requirement of completely server-side transactions might be a difficult limitation. However, many applications prefer to execute transaction code on the database server anyway (in the form of stored procedures), in order to avoid multiple round trip messages between the database server and application server in the middle of a transaction.

If this limitation is acceptable, Calvin presents a nice alternative in the tradeoff space to achieving high scalability without sacrificing ACID or multi-shard transactions. Hence, we believe that our SIGMOD paper may present a roadmap for overcoming the scalability dominance of the decades-old database solutions on traditional OLTP workloads. We look forward to debating the merits of this approach in the weeks ahead (and Alex will be presenting the paper at SIGMOD next week).

43 comments:

  1. Hi,

    Thanks for nice write up, it helped me assure that I understand paper about Calvin correctly.

    Now some fearless questions:
    Will be Calvin Open Sourced?
    Or at least could be published sample code of some transaction in C++ for Calvin to see how difficult it is to write them?

    Thanks.

    ReplyDelete
    Replies
    1. Boris,

      We are indeed planning on open sourcing the Calvin codebase in the coming months. For now, though, an example C++ transaction using Calvin's default (non-SQL) data interface might look something like this:


      // Transaction state is stored in a protocol
      // buffer (TxnProto) whose fields include:
      // * 'stored_procedure_id' identifying which
      // registered stored procedure to invoke
      // * 'read_set' listing all keys the txn is
      // allowed to read
      // * 'write_set' listing all keys the txn is
      // allowed to write
      // * 'args' containing any additional arguments
      // passed bythe client invoking the stored
      // procedure
      //
      // This example transaction reads one
      // record, appends a string to the result,
      // and writes the new string out to
      // another record.
      //
      void MyTxn(TxnProto* txn) {
      // Read value(s) from database.
      const string& key = txn->read_set(0);
      string result = storage_manager->Get(key);

      // Do whatever computation you want here. Certain
      // nondeterministic operations are disallowed,
      // such as GetTime() and Rand(). (If these are
      // needed, they must by called in advance by
      // the client, who can then pass their results
      // to the transaction code via the 'args'
      // parameter.
      string value = result.append(txn->args(0));

      // Write value(s) to database.
      const string& key = txn->write_set(0);
      storage_manager->Put(key, value);
      }


      Now, to invoke this stored procedure, the client-side code creates and sends a request:


      // Create a transaction request object.
      TxnProto txn;
      txn->stored_procedure_id = MY_TXN_ID;

      // Specify key(s) to read.
      txn->add_read_set("key1");

      // Specify key(s) to write.
      txn->add_write_set("key3");

      // Specify additional argument(s).
      txn->add_args("bar");

      // Send transaction to Calvin.
      db_connection->request_txn(txn);


      I hope this clarifies things a bit.

      Delete
    2. How this project progressing? Any new info about open sourcing?

      Delete
    3. I would happily donate some of my time to help the open source effort.

      Delete
  2. Your post ignores that there is an exception to almost everything you state as "costs" to scalability.

    RavenDB.

    1. RavenDB is fully ACID for writes. It is BASE for reads.

    2. It supports transactions spanning multiple shards.

    3. It is one of the lowest cost to entry systems that is enterprise quality and durable.

    4. I don't really know what this actually is about.

    So to summarize, RavenDB has every single positive you list on this page, and zero negatives.

    Also those performance metrics are absolutely meaningless in regards to RDBMS because to get that performance requires models that aren't used for real. Except in edge cases of extreme performance optimization. To get blazing fast RDBMS, ironically you must use it... NON-RELATIONALLY. Relational DOES NOT scale. Period.

    RavenDB also supports many of the scenarios that would require joins, except it doesn't actually need to join at all.

    ReplyDelete
    Replies
    1. Let me make an amendment, it's BASE for QUERIES. It's ACID for WRITE and LOAD (by id).

      Delete
  3. How does Calvin handle correlated transactions?

    ie. A transaction that writes X based on the result of a read of Y.

    It seems like you get into a chicken and egg problem. You read Y (without locking it), figure out what X is to be, prep X, and then acquire locks. But between your initial read of Y, and your attempt to write X, there's nothing to prevent another user from modifying Y, in which case your transaction isn't isolated.

    ReplyDelete
    Replies
    1. This is what I understand from Calvin paper:
      You need to split transaction into 2+ transactions. First one will calculate needed reads and writes for second part. But second transaction could fail (do nothing) because this expectation could be changed meanwhile.
      In that case you will need to repeated it as many times until expectation matches.
      I think in worst case this could forbid progress, and some kind of fairness needs to be implemented.

      Delete
    2. "You need to split transaction into 2+ transactions. "

      In which case it's no longer atomic.

      Looking at the Calvin paper, I don't see how you could do this with 2 transactions. There isn't any of the book-keeping you'd need (for each X and Y you'd need some kind of version numbering ...).

      Might be there. I just don't see it.

      Delete
    3. Boris, you're exactly correct.

      If a transaction's read/write set is not known in advance, we first execute a "reconnaissance" query---not really a transaction, since it doesn't update anything and can be run at low isolation---then use the results of that to start the "actual" transaction with full knowledge of its read/write set.

      Note that this is not a decomposition into two separate transactions that each perform some of the original transaction's updates. Atomicity is not threatened by this technique.

      It is indeed possible for another transaction to have updated the records on which this read/write set depends in between the recon phase and actual execution, so the system does have to check for this, and possibly restart the process. Our previous paper (http://cs.yale.edu/homes/thomson/publications/determinism-vldb10.pdf) includes a discussion of the costs associated with this (see section 4.2 and the second page of the Appendix).

      Delete
    4. Nonatomic transactions, aka useless transactions.

      There is absolutely no way you can protect every edge case including power outages, network drops, packet loss, disk crashes, disks overloaded etc that cannot result in a scenario where your multiple transactions can't get into a state where an operation is missed, or thought completed but actually failed.

      Delete
    5. So if what you're doing is a transactional, distributed hash table with two operations, insert() and read(), Calvin's the way to go.

      Fair enough.

      Delete
    6. On the contrary, there already exist lots of scalable DHTs that provide ACID transactions on single-row reads and inserts (e.g. BigTable/HBase).

      The goal of Calvin is to support arbitrary read-modify-write transactions spanning any set of data---not just ACID inserts and reads.

      Delete
    7. Now I'm confused.

      I thought you just explained that Calvin cannot support atomic update ops. I thought this is what you meant when you wrote "It is indeed possible for another transaction to have updated the records on which this read/write set depends in between the recon phase and actual execution, so the system does have to check for this, and possibly restart the process." To implement an update, you need to break the op in to two phases: a pre-compute phase, and then an apply phase.

      Which makes the classic "UPDATE ACCOUNT SET Value = Value * 1.001 ..." query kinda awkward to implement.

      I'll clarify my thinking: Calvin's seems to be a step forward in the world of distributed hash table transactions, and the "compute before you commit" idea is kinda neat. But we're still a long way away from implementing banking apps (which might not be a big problem - let a thousand flowers bloom, etc).

      Delete
    8. The whole point of Calvin is to handle those arbitrary update transactions that involve multiple nodes in the cluster. Perhaps looking at Section 4.2 of the older VLDB 2010 paper will resolve your confusion: http://cs.yale.edu/homes/thomson/publications/determinism-vldb10.pdf

      Delete
    9. Yeah. And in Section 4.2 of that also rather good paper, I read this:

      "U2 has some information about what it probably has to lock and immediately locks these items. It then checks if it locked the correct items (i.e., none of the transactions that ran in the meantime changed the dependency)."

      I look at "some information" and "probably has to lock" and I smell something complex. From my perspective, the only way U2 can "check if it locked the correct items" is to re-run U1. So in this case, the approach doubles the cost of the transaction: it must run it twice. Your paper gets it exactly right when it says "This method works on a principle similar to that of optimistic concurrency control, and as in OCC, decomposed dependent transactions run the risk of starvation should their dependencies often be updated between executions of the decomposed parts."

      Which might not be such a big problem in practice, although it's worth noting that OCC didn't "win". Update operations of this kind aren't especially common in the field, and doubling Tx cost on unusual transactions might be an entirely reasonable price to pay given the advantages Calvin has in more common ops.

      But it's important to understand the trade-offs.

      Delete
  4. Surely a new impressive view of transaction processing scalability and replication problems.

    This approach is suitable for OLTP transactions where the size of transaction is small. Also each transaction should predefined and coded on the server.

    But a scalability of 500,000 txn per second on a 100 commodity nodes is surely promising along with the capability to remotely replicate.

    Thanks for sharing.

    ReplyDelete
  5. Fantastic! This is very exciting. :) Thank you for sharing, I will be keeping my eye on this.

    ReplyDelete
  6. If the replicas process the transactions at their own pace, I'd have to hit the same replica if I wanted read-after-consistency?

    Maybe I'm mistaken but Calvin sounds a lot like VoltDB but extended to work with large datasets by cleverly faulting in required pages from disk into memory prior to acquiring locks. VoltDB doesn't deal with this case by just assuming everything fits in memory.

    ReplyDelete
    Replies
    1. You're right that Calvin doesn't automatically promise read-after consistency. However, each committed transaction returns to the client its logical timestamp. This timestamp is actually the transaction's place in the serial transaction ordering to which all replicas' executions must be equivalent. A client could therefore specify that a read must see a version at least as new as the timestamps of previous writes. It would be straightforward to augment Calvin's client-side interface to track the history of the client's database interactions and automatically added these annotations to read requests, so that they could safely be sent to any replica.

      And yes, Calvin and VoltDB do have a number of things in common. But here are a couple more differences:

      1) Although VoltDB's serial execution scheme does generally yield equivalence to a predefined serial order, I don't believe the system implements an explicit determinism invariant. VoltDB can therefore abort inconvenient transactions more easily than Calvin can, but at the cost of needing a distributed commit protocol for multi-shard transactions.

      2) VoltDB does not implement any row-level locking scheme, so each partition's execution thread blocks completely on any remote read, whereas Calvin tries to maintain high concurrency at each shard, even in the presence of lots of multi-shard transactions.

      Essentially VoltDB is highly optimized for embarassingly partitionable applications, and is extremely successful in that space. Calvin is designed to handle applications that are NOT perfectly partitionable.

      Delete
    2. Alex,

      Thanks for taking care of a lot of these comments. I would add that a major difference between VoltDB and Calvin is that VoltDB uses 2PC for distributed transactions and Calvin does not.

      On the other hand, I do think they have an explicit determinism invariant --- I don't see how their command logging implementation would work without it.

      The key to note is that when VoltDB runs TPC-C, they disable the 10% remote NewOrder transactions (just like the rest of the NewSQL crowd) --- see http://community.voltdb.com/node/134 --- here's a quote from that page: "The VoltDB benchmark differs from the official tpc-c benchmark in two significant ways. Operationally, the VoltDB benchmark does not include any wait times, which we feel are no longer relevant. It also does not include fulfillment of orders submitted to one warehouse, with items from another warehouse (approximately 10% of the new order transactions in the official benchmark)."

      Delete
  7. Nice work! Can I ask about the elephant in the room? Why is that the sequencer is not a bottleneck in Calvin? If all the transactions go through a single sequencer, it is certainly becomes the bottleneck. If you partition the transactions among multiple sequencers, then synchronization among the sequencers does not scale. Can you please comment on this? Thanks.

    ReplyDelete
    Replies
    1. Thanks for the great question, Alice! This was definitely one of the tricky parts of getting Calvin to scale gracefully.

      We do partition the transactions across multiple sequencers---in fact, every storage node in the deployment also holds a sequencer node. Each sequencer node independently generates and publishes ordered transaction batches for every 10ms epoch. At each epoch, each scheduler node collects every sequencer's batch for that epoch (actually only the relevant 'sub-batch' of transactions in the batch that actually need to execute at that scheduler), then interleaves them in a deterministic manner to get a view of the total transaction request sequence. So sequencers don't synchronize with one another (at least, not within the same replica), but each one interacts with all scheduler nodes.

      This seems at first like it wouldn't scale, but we actually found that it does. Here's why: When a sequencer is done compiling a batch of transaction requests for a given 10ms epoch, it doesn't send the whole batch out to every single scheduler. It only sends to each scheduler the small sub-batch of transactions that are relevant to that scheduler. As the total number of nodes grows, the size of each sub-batch shrinks (assuming the number of nodes a transaction is expected to touch remains constant). So the total data volume of sub-batches sent out by sequencers stays roughly constant. Similarly, each scheduler accepts sub-batches from increasingly many sequencers, but as more nodes are added (and as hot records gets spread out to rebalance load), the total number of requests in all sub-batches for a given epoch doesn't change.

      (Before we realized this, I thought this would for sure be a bottleneck, so I looked into using a scalable publish-subscribe service that implemented lots of fanout, but it turned out we never needed it, at least not for ~100 node Calvin instances.)

      Delete
    2. This comment has been removed by the author.

      Delete
    3. Thanks for the detailed response. In this design, is not each sequencer a single point of failure?

      > each scheduler node collects every sequencer's batch for that epoch ...

      If a sequencer does not send its batches (sub-batches) in a timely manner, no scheduler can progress, right?

      Delete
    4. Each sequencer is replicated using paxos. Please see the full paper for more details.

      Delete
    5. The failover time in practical deployments of Paxos in a WAN could be as high as 10s of seconds (it is certainly not instantaneous). The more potential single points of failures in a system (here the sequencers), the higher the chance of unavailability due to a failover delay. Anyway, 500K TPS over a WAN is very impressive and a compromise in latency/availability is surely acceptable.

      Delete
    6. You're right that WAN failover time can be high (we use local failure detectors within each datacenter to reduce this time, so we expect failover to take on the order of a second rather than 10s of seconds---this feature is under development and still being tested, but feel free to ping me in a month or two if you'd like to hear the results of our experiments on this).

      But you're right that it is possible for an entire database replica to get blocked for that duration if a sequencer node fails. Note, however, that only that one replica experiences any hiccup---the others continue on seamlessly.

      Delete
  8. Would you mind clarifying what you mean here?

    "There exists no application in the modern world that produces more than 500,000 transactions per second (as long as humans are initiating the transactions---machine-generated transactions are a different story)."

    Why are you assuming that there is no application that needs more than 500k transactions per second? A single user surely can't initiate 500k transactions a second but I think google/facebook/amazon etc. have a pretty good usecase for needing to scale past 500k transactions per second with user input.

    ReplyDelete
    Replies
    1. You're quite right. MMOs keeping track of game state for thousands of concurrent users and millions of in-game objects are another example. And besides, more and more systems DO have to process high volumes of machine-generated transactions, particularly in the finance realm.

      The point we were trying to make here is that until recently, that kind of transactional throughput requirement was rare, so scale-up systems using traditional database technologies sufficed for enterprises, despite high cost-per-performance.

      That more applications are now emerging with colossal throughput requirements reinforces our view that the need for a cost-effective, scale-out approach to general-purpose transaction processing is becoming more urgent.

      Delete
  9. This comment has been removed by the author.

    ReplyDelete
  10. Hi,

    This looks fantastic - really great effort. I work in finance and we have some fairly difficult transactional requirements.

    Your principles are quite similar to those used in some event sourcing architectures (for example the LMAX Exchange) and also a very good product by IBM called LLM (Low Latency Messaging) - specifically the RCMS component. It provides sequencing, a global ordering across a set of partitions, and assumes determinism in your transactions (but has mechanisms to support (limited numbers) of non-deterministic transactions). Importantly they use a reliable multicast messaging for replication to the tier. This may or may not be of interest to you.

    I wasn't quite sure how you would move partitions in the case of failure of some part of the cluster fails?

    Thanks again. Very interesting.

    ReplyDelete
    Replies
    1. Alex,

      That's really interesting about LLM---I wasn't familiar with that product and I'll definitely look into the similarities and differences. Thanks for referring that to my attention!

      As for Calvin's failure handling, there are really two types of failure modes:

      1) Simple failures, in which one or more machines fail, but a quorum of replicas of the failed machine(s) remain active. In this case, a node containing the same partition but in a different replica takes over serving the afflicted node's outbound traffic (remote read results, transaction results, etc.). We are currently working on integrating a low-latency failure detector into Calvin to make this fail-over relatively seamless.

      2) Quorum loss failures, in which multiple replicas of the same partition fail, so that the partition can no longer achieve progress in the Paxos-based replication of input batches. This is a MUCH more complicated failure case, and there are a number of possible fail-over behaviors that would make sense to implement, all of which appear to be expensive. In this case, a noticeable hiccup in latency/availability may be unavoidable, although we're still examining approaches to this problem. Note also that this should be an extremely rare phenomenon, especially when the system is replicated across multiple data centers.

      Delete
    2. This comment has been removed by the author.

      Delete
  11. It is not clear -- reading the blog -- how those TPC-C transactions that would affect multiple shards (presumably multiple TPC-C "warehouses", assuming you are using these as shards) are handled . . .

    ReplyDelete
  12. Hi Daniel and Alexander! Very cool work! I had a question regarding the evaluation of this paper as well as your earlier paper that argues for determinism (as well as the original "The End of ... Rewrite" paper). In all these papers, you are using TPCC without the wait time and think time. Can you point me to the key reason as to why is this acceptable? Seemingly, this approach increases the concurrency in the workload making your results look better than they would with less concurrency. Furthermore, can you point me how many clients did you use per warehouse? Was it 10 clients/warehouse or are you guys reporting the max throughput obtained by varying the number of clients. Thanks a lot in advance for your reply!

    Best,

    Prince

    ReplyDelete
    Replies
    1. Prince---
      Rather than implementing a fixed number of discrete clients for our TPC-C experiments, we implemented a distributed service that hammered the Calvin deployment with more New Order transaction requests than it could possibly execute. This simulated a front-end load balancer and an unbounded number of clients. Calvin then internally throttled the request load to the number of transaction requests per second that it could actually handle, leaving the rest on the incoming-transaction queue. We did this to make sure not to artificially limit throughput by underloading the system.

      Although this is a minor deviation from the TPC-C specification, it remains in the spirit of the benchmark. After all, a real-world TPC-C-esque ecommerce system would have a load balancer, and most New Order requests would come from many different clients at random intervals (rather than repeatedly from the same small set of clients with predictable wait/keying times between each request).

      The result of this change is actually that our workload experiences slightly higher contention than standard TPC-C (i.e. it is generally HARDER to achieve high throughput and scale to many machines under these conditions). We think this is a reasonable modification since there do exist real applications which experience very, very high contention.

      Examples:

      - Finance applications, particularly high-frequency trading. Prices and quantities of a very finite number of stocks are updated in a continuous stream as a result of zillions of trades per day.

      - Multiplayer games. Each player interacts with other players and environment objects in real time.

      Anyway, there already exist (a) systems that can handle high contention without distributed transactions (e.g. HStore/VoltDB), and (b) systems that handle low-contention distributed transactions (e.g. anything based on the System R* 2PC model). However, there are no commercial systems that can do both at the same time---high contention AND distributed transactions. There is a real demand for systems that are flexible enough to handle either or both of these challenges, and this is the niche that we're targeting with the Calvin research project. (Note, for example, that financial transactions often involve two or more parties. It would be very natural for the records representing these entities (which are therefore updated in the transactions) to be sharded across different servers. To my knowledge, the systems that track major stock exchanges do NOT do this. They use scale-up rather than scale-out systems , precisely because there is no commercial solution to the problem of high-contention distributed transactions.)

      I hope this helps!
      Alex

      Delete
  13. I assume you are requiring solid state drives in your tests...

    ReplyDelete
    Replies
    1. S931Coder---

      Many of the performance measurements in the paper store all data in memory. All logical logging of incoming transactions uses standard rotating disks. Section 4 discusses Calvin's performance characteristics when not all data fits in memory, and considers storing data on standard rotating disks. We have not published any performance numbers involving solid state drives.

      Alex

      Delete
  14. "In Calvin, all replicas agree in advance on the sequence of transactions that they will (deterministically) attempt to execute."


    what if a replica failed to commit the transaction and hence rolledback the changes?
    is the transaction on the (first hit server) master also rolledback?
    how is the rollback possible without suffering wait-delay-for-response-from-replica?

    If master does not rollback the transaction even if the transaction failed in one/more of the replicas,
    then no replica is guaranteed to be consistent with the master [in effect no node is consistent with data
    replicated from other nodes] and hence is not reliable

    ReplyDelete
    Replies
    1. Let's compare this to the industry standard 2PC. In a two phase commit, first all parties (server) prepare the transaction. Then the broker will notify the parties to commit the transaction. Each party then writes the changes permanently. The theory is that >90% of the risky work already took place on the prepare phase. But there can be a problem on the commit phase. And there is not a way to recover within the 2PC protocol.

      CalvinDB is just changing the order of operations, a Simple example is a foreign key check. FK's are checked during insert. So there is no way to know whether an insert will succeed unless the insert is done. But it does not need to be that way. We can separate the work such that we check the FK first and then do the insert. If you look at most transactions there is a lot of checking code that is intermingled with the writing code, that is a consequence of imperative programming style vs functional or declarative. That is why the writers of 2PC chose "Prepare" to mean virtually execute the transaction. I personally think that this approach deserves to coin a new term, I like Async ACID. Where the transaction gets validated on the first phase, and written on the second phase.

      Delete
  15. I can't wait to read the documentation or the code, when it is made available. Since Alexander Thomson said that the database was placed in RAM, I think it is likely you used the High-Memory Cluster instances which run at $10,000 for a 3 year contract. For a 100 nodes the total comes up to $1,000,000. that is the ceiling and it still is 50 times less that the 50 mill for the equipment used in the winning TPC benchmark. Or did you use a micro instance. i am just curious. Comparative cost data info will eventually come out, so I may just need to wait.

    ReplyDelete