Compact Event Graph Encoding

distributed-systemsdata-structuresoptimization
size(G)=O(Eavg_run1)\text{size}(G) = O(|E| \cdot \text{avg\_run}^{-1})

Statement

The eg-walker stores the event graph in a column-oriented, run-length encoded format that exploits the sequential nature of typical text editing. Events are topologically sorted and grouped into runs of consecutive operations from the same replica.

Encoding Scheme

Events are decomposed into columns, each compressed independently:

ColumnContentCompression
Operation type + start positionIns/Del + indexRun-length: consecutive inserts at adjacent positions collapse
Inserted contentUTF-8 charactersLZ4 compression of concatenated text
ParentsCausal predecessor IDsDefault = predecessor in sort; only exceptions stored
Event IDs(replica, seq) pairsRuns of consecutive sequence numbers from same replica

Visualization

A typical editing session with three users:

graph TD
    subgraph "Raw Events (24 events)"
        R["e₁: Ins(0,'H'), e₂: Ins(1,'e'), e₃: Ins(2,'l'), e₄: Ins(3,'l'), e₅: Ins(4,'o'), ..."]
    end

    subgraph "Run-Length Encoded (3 runs)"
        C1["Run 1: Alice inserts 'Hello' at pos 0..4"]
        C2["Run 2: Bob inserts ' World' at pos 5..10"]
        C3["Run 3: Carol deletes pos 11..13"]
    end

    R --> |"column encoding"| C1
    R --> C2
    R --> C3

Parent Encoding

The key insight for parent compression: in a topological sort, most events' parent is simply the immediately preceding event. Only branch points and merge points need explicit parent storage.

Parent(ei)={ei1(default — stored implicitly)explicit IDs(branch/merge points only)\text{Parent}(e_i) = \begin{cases} e_{i-1} & \text{(default — stored implicitly)} \\ \text{explicit IDs} & \text{(branch/merge points only)} \end{cases}

For a document with nn events and bb branch/merge points, parent storage is O(b)O(b) rather than O(n)O(n).

Space Complexity

ScenarioEventsCompressed sizeRatio
Single user, sequential typingnnO(n/avg_run_length)O(n / \text{avg\_run\_length})~10–50× reduction
Real-time collaboration (2–5 users)nnO(nc)O(n \cdot c) where c1c \ll 1~5–20× reduction
Worst case (random interleaving)nnO(n)O(n)No compression benefit

Comparison with CRDT Storage

Traditional CRDTs (Yjs, Automerge) store per-character metadata in the document state:

CRDT memory=D(ID size+tombstone overhead+ordering metadata)\text{CRDT memory} = |D| \cdot (\text{ID size} + \text{tombstone overhead} + \text{ordering metadata})

Eg-walker stores metadata in the event history and keeps only the document string in steady state:

Eg-walker memory=D+event log (compressed)\text{Eg-walker memory} = |D| + |\text{event log (compressed)}|

The event log can be stored on disk and only the portion after the last critical version needs to be in memory for processing new events.

Connections

The compact encoding makes the Eg-walker: Event Graph WalkerEg-walker: Event Graph Walkerdoc=replay(G)where G=(E,)\text{doc} = \text{replay}(G) \quad \text{where } G = (E, \rightarrow)A collaboration algorithm for text editing that combines CRDT convergence with OT-competitive memory and merge performance.Read more → practical for real-world use. Critical Versions and Partial ReplayCritical Versions and Partial Replaye1G1,  e2G2:e1e2\forall e_1 \in G_1,\; \forall e_2 \in G_2: \quad e_1 \rightarrow e_2How critical versions partition the event graph into independently processable sections, enabling garbage collection and efficient state reconstruction.Read more → enables discarding old portions of the encoded history. The encoding is orthogonal to the Internal State MachineInternal State Machinesp{NotInsertedYet,Ins,Del1,Del2,}s_p \in \{\text{NotInsertedYet}, \text{Ins}, \text{Del}_1, \text{Del}_2, \ldots\}The per-character state machine that tracks insertion/deletion across prepare and effect versions in eg-walker.Read more → — it affects storage, not processing semantics.