Concurrent updates to multiple replicas of the same data, without coordination between the computers hosting the replicas, can result in inconsistencies between the replicas, which in the general case may not be resolvable. Restoring consistency and data integrity when there are conflicts between updates may require some or all of the updates to be entirely or partially dropped.
Accordingly, much of distributed computing focuses on the problem of how to prevent concurrent updates to replicated data. But another possible approach is optimistic replication, where all concurrent updates are allowed to go through, with inconsistencies possibly created, and the results are merged or "resolved" later. In this approach, consistency between the replicas is eventually re-established via "merges" of differing replicas. While optimistic replication might not work in the general case, there is a significant and practically useful class of data structures, CRDTs, where it does work — where it is always possible to merge or resolve concurrent updates on different replicas of the data structure without conflicts. This makes CRDTs ideal for optimistic replication.
As an example, a one-way Boolean event flag is a trivial CRDT: one bit, with a value of true or false. True means some particular event has occurred at least once. False means the event has not occurred. Once set to true, the flag cannot be set back to false (an event having occurred cannot un-occur). The resolution method is "true wins": when merging a replica where the flag is true (that replica has observed the event), and another one where the flag is false (that replica hasn't observed the event), the resolved result is true — the event has been observed.
State-based CRDTs (also called convergent replicated data types, or CvRDTs) are defined by two types, a type for local states and a type for actions on the state, together with three functions: A function to produce an initial state, a merge function of states, and a function to apply an action to update a state. State-based CRDTs simply send their full local state to other replicas on every update, where the received new states is then merged into the local state. To ensure eventual convergence the functions should fulfill the following properties:
The merge function should compute the join for any pair of replica states, and should form a semilattice with the initial state as the neutral element. In particular this means, that the merge function must be commutative, associative, and idempotent. The intuition behind commutativity, associativity and idempotence is that these properties are used to make the CRDT invariant under package re-ordering and duplication. Furthermore, the update function must be monotone with regard to the partial order defined by said semilattice.
Delta state CRDTs[11][14] (or simply Delta CRDTs) are optimized state-based CRDTs where only recently applied changes to a state are disseminated instead of the entire state.
Operation-based CRDTs
Operation-based CRDTs (also called commutative replicated data types, or CmRDTs) are defined without a merge function. Instead of transmitting states, the update actions are transmitted directly to replicas and applied. For example, an operation-based CRDT of a single integer might broadcast the operations (+10) or (−20). The application of operations should still be commutative and associative. However, instead of requiring that application of operations is idempotent, stronger assumptions on the communications infrastructure are expected -- all operations must be delivered to the other replicas without duplication.
Pure operation-based CRDTs[13] are a variant of operation-based CRDTs that reduces the metadata size.
Comparison
The two alternatives are theoretically equivalent, as each can emulate the other.[1]
However, there are practical differences.
State-based CRDTs are often simpler to design and to implement; their only requirement from the communication substrate is some kind of gossip protocol.
Their drawback is that the entire state of every CRDT must be transmitted eventually to every other replica, which may be costly.
In contrast, operation-based CRDTs transmit only the update operations, which are typically small.
However, operation-based CRDTs require guarantees from the communication middleware; that the operations are not dropped or duplicated when transmitted to the other replicas, and that they are delivered in causal order.[1]
While operations-based CRDTs place more requirements on the protocol for transmitting operations between replicas, they use less bandwidth than state-based CRDTs when the number of transactions is small in comparison to the size of internal state. However, since the state-based CRDT merge function is associative, merging with the state of some replica yields all previous updates to that replica. Gossip protocols work well for propagating state-based CRDT state to other replicas while reducing network use and handling topology changes.
Some lower bounds[15] on the storage complexity of state-based CRDTs are known.
This state-based CRDT implements a counter for a cluster of n nodes. Each node in the cluster is assigned an ID from 0 to n - 1, which is retrieved with a call to myId(). Thus each node is assigned its own slot in the array P, which it increments locally. Updates are propagated in the background, and merged by taking the max() of every element in P. The compare function is included to illustrate a partial order on the states. The merge function is commutative, associative, and idempotent. The update function monotonically increases the internal state according to the compare function. This is thus a correctly defined state-based CRDT and will provide strong eventual consistency. The operations-based CRDT equivalent broadcasts increment operations as they are received.[2]
A common strategy in CRDT development is to combine multiple CRDTs to make a more complex CRDT. In this case, two G-Counters are combined to create a data type supporting both increment and decrement operations. The "P" G-Counter counts increments; and the "N" G-Counter counts decrements. The value of the PN-Counter is the value of the P counter minus the value of the N counter. Merge is handled by letting the merged P counter be the merge of the two P G-Counters, and similarly for N counters. Note that the CRDT's internal state must increase monotonically, even though its external state as exposed through query can return to previous values.[2]
Two G-Sets (grow-only sets) are combined to create the 2P-set. With the addition of a remove set (called the "tombstone" set), elements can be added and also removed. Once removed, an element cannot be re-added; that is, once an element e is in the tombstone set, query will never again return True for that element. The 2P-set uses "remove-wins" semantics, so remove(e) takes precedence over add(e).[2]
LWW-Element-Set (Last-Write-Wins-Element-Set)
LWW-Element-Set is similar to 2P-Set in that it consists of an "add set" and a "remove set", with a timestamp for each element. Elements are added to an LWW-Element-Set by inserting the element into the add set, with a timestamp. Elements are removed from the LWW-Element-Set by being added to the remove set, again with a timestamp. An element is a member of the LWW-Element-Set if it is in the add set, and either not in the remove set, or in the remove set but with an earlier timestamp than the latest timestamp in the add set. Merging two replicas of the LWW-Element-Set consists of taking the union of the add sets and the union of the remove sets. When timestamps are equal, the "bias" of the LWW-Element-Set comes into play. A LWW-Element-Set can be biased towards adds or removals. The advantage of LWW-Element-Set over 2P-Set is that, unlike 2P-Set, LWW-Element-Set allows an element to be reinserted after having been removed.[2]
OR-Set (Observed-Remove Set)
OR-Set resembles LWW-Element-Set, but using unique tags instead of timestamps. For each element in the set, a list of add-tags and a list of remove-tags are maintained. An element is inserted into the OR-Set by having a new unique tag generated and added to the add-tag list for the element. Elements are removed from the OR-Set by having all the tags in the element's add-tag list added to the element's remove-tag (tombstone) list. To merge two OR-Sets, for each element, let its add-tag list be the union of the two add-tag lists, and likewise for the two remove-tag lists. An element is a member of the set if and only if the add-tag list less the remove-tag list is nonempty.[2] An optimization that eliminates the need for maintaining a tombstone set is possible; this avoids the potentially unbounded growth of the tombstone set. The optimization is achieved by maintaining a vector of timestamps for each replica.[16]
Some known Sequence CRDTs are Treedoc,[5]
RGA,[17] Woot,[4]
Logoot,[18] and LSEQ.[19]
CRATE[20] is a decentralized real-time editor built on top of LSEQSplit (an extension of LSEQ) and runnable on a network of browsers using WebRTC.
LogootSplit [21] was proposed as an extension of Logoot in order to reduce the metadata for sequence CRDTs. MUTE [22][23] is an online web-based peer-to-peer real-time collaborative editor relying on the LogootSplit algorithm.
Industrial sequence CRDTs, including open-source ones, are known to out-perform academic implementations due to optimizations and a more realistic testing methodology.[24] The main popular example is Yjs CRDT, a pioneer in using a plainlist instead of a tree (ala Kleppmann's automerge).[25]
Industry use
Zed is an open-source collaborative IDE built by zed-industries that uses CRDTs in its "buffers" to handle conflicts when doing collaborative edits on the same file.
Fluid Framework is an open-source collaborative platform built by Microsoft that provides both server reference implementations and client-side SDKs for creating modern real-time web applications using CRDTs.
Nimbus Note is a collaborative note-taking application that uses the Yjs CRDT for collaborative editing.[26]
Redis is a distributed, highly available, and scalable in-memory database with a "CRDT-enabled database" feature.[27]
SoundCloud open-sourced Roshi, a LWW-element-set CRDT for the SoundCloud stream implemented on top of Redis.[28]
Riak is a distributed NoSQL key-value data store based on CRDTs.[29]League of Legends uses the Riak CRDT implementation for its in-game chat system, which handles 7.5 million concurrent users and 11,000 messages per second.[30]
Bet365 stores hundreds of megabytes of data in the Riak implementation of OR-Set.[31]
TomTom employs CRDTs to synchronize navigation data between the devices of a user.[32]
Phoenix, a web framework written in Elixir, uses CRDTs to support real-time multi-node information sharing in version 1.2.[33]
Facebook implements CRDTs in their Apollo low-latency "consistency at scale" database.[34]
Facebook uses CRDTs in their FlightTracker system for managing the Facebook graph internally.[35]
Teletype for Atom employs CRDTs to enable developers share their workspace with team members and collaborate on code in real time.[36]
Haja Networks' OrbitDB uses operation-based CRDTs in its core data structure, IPFS-Log.[37]
Apple implements CRDTs in the Notes app for syncing offline edits between multiple devices.[38]
Swim is a platform for running distributed real-time streaming applications that deliver continuous intelligence. It uses streaming actors that stream pure op-based CRDT state updates to other actors in a DAG that implements a streaming data pipeline.
RxDB is a client-side NoSQL database for distributed real-time streaming applications. It has a CRDT plugin that enables updating a document by storing NoSQL based CRDT deltas and replicating these with other clients or a backend server.
PGD is a multi-master replication solution based on PostgreSQL. It supports CRDT column types.
Novell, Inc. introduced a state-based CRDT with "loosely consistent" directory replication (NetWare Directory Services), included in NetWare 4.0 in 1995.[39] The successor product, eDirectory, delivered improvements to the replication process.[40]
^ abcdefg
Shapiro, Marc; Preguiça, Nuno; Baquero, Carlos; Zawirski, Marek (13 January 2011). "A Comprehensive Study of Convergent and Commutative Replicated Data Types". Rr-7506.
^
Shapiro, Marc; Preguiça, Nuno (2007). "Designing a Commutative Replicated Data Type". arXiv:0710.1784 [cs.DC].
^ ab
Letia, Mihai; Preguiça, Nuno; Shapiro, Marc (2009). "CRDTs: Consistency without Concurrency Control". Computing Research Repository. arXiv:0907.0929.
^
Baquero, Carlos; Moura, Francisco (1 October 1999). "Using Structural Characteristics for Autonomous Operation". SIGOPS Oper. Syst. Rev. 33 (4): 90–96. doi:10.1145/334598.334614. hdl:1822/34984. S2CID13882850.
^ ab
Almeida, Paulo Sérgio; Shoker, Ali; Baquero, Carlos (2015-05-13). "Efficient State-Based CRDTS by Delta-Mutation". In Bouajjani, Ahmed; Fauconnier, Hugues (eds.). Networked Systems. Lecture Notes in Computer Science. Vol. 9466. Springer International Publishing. pp. 62–76. arXiv:1410.2803. doi:10.1007/978-3-319-26850-7_5. ISBN9783319268491. S2CID7852769.
^ ab
Baquero, Carlos; Almeida, Paulo Sérgio; Shoker, Ali (2014-06-03). "Making Operation-Based CRDTS Operation-Based". In Magoutis, Kostas; Pietzuch, Peter (eds.). Distributed Applications and Interoperable Systems. Lecture Notes in Computer Science. Vol. 8460. Springer Berlin Heidelberg. pp. 126–140. CiteSeerX10.1.1.492.8742. doi:10.1007/978-3-662-43352-2_11. ISBN9783662433515.
^
Almeida, Paulo Sérgio; Shoker, Ali; Baquero, Carlos (2016-03-04). "Delta State Replicated Data Types". Journal of Parallel and Distributed Computing. 111: 162–173. arXiv:1603.01529. doi:10.1016/j.jpdc.2017.08.003. S2CID7990602.
^Roh, Huyn-Gul; Jeon, Myeongjae; Kim, Jin-Soo; Lee, Joonwon (2011). "Replicated Abstract Data Types: Building Blocks for Collaborative Applications". Journal of Parallel and Distributed Computing. 71 (2): 354–368. doi:10.1016/j.jpdc.2010.12.006.
^
Weiss, Stephane; Urso, Pascal; Molli, Pascal (2010). "Logoot-Undo: Distributed Collaborative Editing System on P2P Networks". IEEE Transactions on Parallel and Distributed Systems. 21 (8): 1162–1174. doi:10.1109/TPDS.2009.173. ISSN1045-9219. S2CID14172605.
^
André, Luc; Martin, Stéphane; Oster, Gérald; Ignat, Claudia-Lavinia (2013). "Supporting Adaptable Granularity of Changes for Massive-scale Collaborative Editing". Proceedings of the International Conference on Collaborative Computing: Networking, Applications and Worksharing – CollaborateCom 2013. pp. 50–59. doi:10.4108/icst.collaboratecom.2013.254123. ISBN978-1-936968-92-3.