📅 Original date posted:2014-01-05
📝 Original message:Hello and happy new year to this mailing list!
Thank you Mark for the incredible work you've been doing on this.
I am following this very closely, because it is of primary importance
for Electrum.
I have written a Python-levelDB implementation of this UTXO hashtree,
which is currently being tested, and will be added to Electrum servers.
My implementation follows Alan Reiner's idea to store the tree as items
in a key-value database. I believe that a C++ implementation like yours
will be at least an order of magnitude faster, and I am looking forward
to it.
I too believe that BIPs should define interoperability points, but probably
not implementation details. For the UTXO hashtree, this means that a BIP
should at least specify how the root hash is constructed. This might be the
only thing that needs to be specified.
However, I see no pressing issue with writing a BIP; it might be preferable
to implement and test different options first, and learn from that.
Thomas
Le 20/12/2013 02:47, Mark Friedenbach a écrit :
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hello fellow bitcoin developers. Included below is the first draft of
> a BIP for a new Merkle-compressed data structure. The need for this
> data structure arose out of the misnamed "Ultimate blockchain
> compression" project, but it has since been recognized to have many
> other applications.
>
> In addition to this BIP I am preparing three additional BIPs
> describing the use of this data structure in stateless validation &
> mining, the UBC address index for "SPV+" operating modes, document
> timestamping and merged mining.
>
> A Python implementation of this data structure is available here:
>
> https://github.com/monetizeio/python-bitcoin
>
> A C++ implementation is being worked on.
>
> As per the BIP-1 procedure, I am submitting this rough draft to the
> community for discussion. I welcome all comments and criticisms of
> both form and content.
>
> - -Mark
>
>
> ==Abstract==
>
> This BIP describes a [http://en.wikipedia.org/wiki/Hash_tree Merkle
> hash tree] variant of the [http://en.wikipedia.org/wiki/Trie
> prefix-tree data structure], ideally suited for encoding key-value
> indices which support memory-efficient proofs.
>
> ==Motivation==
>
> There are a number of applications which would benefit from having a
> data structure with the following properties:
>
> * '''Arbitrary mapping of keys to values.''' A ''key'' can be any
> bytestring, and its ''value'' any other bytestring.
> * '''Duplicate keys disallowed.''' Every key has one, and only one
> value associated with it. Some applications demand assurance that no
> key value is reused, and that this constraint can be checked without
> requiring access to the entire data structure.
> * '''Efficient look-up by key.''' The data structure should support
> sub-linear lookup operations with respect to the number of keys in the
> mapping. Logarithmic time or linear with respect to the length of the
> key should be achievable and would be sufficient for realistic
> applications.
> * '''Merkle compression of mapping structure.''' It should be possible
> to produce a reduced description of the tree consisting of a single
> root hash value which is deterministically calculated from the mapping
> structure.
> * '''Efficient proofs of inclusion.''' It should be possible to
> extract a proof of key/value mapping which is limited in size and
> verification time by the length of the key in the worst case.
> * '''Computation of updates using local information.''' Given a set of
> inclusion proofs, it should be possible to calculate adjustments to
> the local mapping structure (update or deletion of included mappings,
> or insertion between two included mappings which are adjacent in the
> global structure).
>
> Such applications include committed validation indices which enable
> stateless mining nodes, committed wallet indices which enable
> trust-less querying of the unspent transaction output set by
> <code>scriptPubKey</code>, efficient document time-stamping, and
> secure & efficient merged mining. This BIP describes an authenticated
> prefix tree which has the above properties, but leaves the myriad
> applications to be formalized in future BIPs.
>
> ==Data structure==
>
> This BIP defines a binary prefix tree. Such a structure provides a
> mapping of bitstrings (the ''keys'') to bytestrings (the ''values'').
> It is an acyclic binary tree which implicitly encodes keys within the
> traversal path -- a "left" branch is a 0, and a "right" branch is a 1.
> Each node is reachable by only one unique path, and reading off the
> branches taken (0 for each left, 1 for each right) as one follows the
> path from root to target yields the node's key.
>
> The particular binary prefix tree defined by this BIP is a hybrid
> PATRICIA / de la Brandais tree structure.
> [http://en.wikipedia.org/wiki/Radix_tree PATRICIA trees] compress a
> long sequence of non-branching nodes into a single interior node with
> a per-branch ''skip prefix''. This achieves significant savings in
> storage space, root hash calculation, and traversal time.
>
> A de la Brandais trie achieves compression by only storing branches
> actually taken in a node. The space savings are minimal for a binary
> tree, but place the serialized size of a non-branching interior node
> under the SHA-256 block size, thereby reducing the number of hash
> operations required to perform updates and validate proofs.
>
> This BIP describes the authenticated prefix tree and its many
> variations in terms of its serialized representation. Additional BIPs
> describe the application of authenticated prefix trees to such
> applications as committed indices, document time-stamping, and merged
> mining.
>
> ==Serialization format==
>
> As a hierarchical structure, the serialization of an entire tree is
> the serialization of its root node. A serialized node is the
> concatenation of five structures:
>
> node := flags || VARCHAR(extra) || value || left || right
>
> The <code>flags</code> is a single byte field whose composite values
> determine the bytes that follow.
>
> flags = (left_flags << 0) |
> (right_flags << 2) |
> (has_value << 4) |
> (prune_left << 5) |
> (prune_right << 6) |
> (prune_value << 7)
>
> The <code>left_flags</code> and <code>right_flags</code> are special
> 2-bit enumeration fields. A value of 0 indicates that the node does
> not branch in this direction, and the corresponding <code>left</code>
> or <code>right</code> branch is missing (replaced with the empty
> string in the node serialization). A value of 1 indicates a single bit
> key prefix for this branch, implicitly 0 for <code>left</code> and 1
> for <code>right</code>. A 2 indicates up to 7 bits of additional skip
> prefix (beyond the implicit first bit, making 8 bits total) are stored
> in a compact single-byte format. A 3 indicates a skip prefix with
> greater than 7 additional bits, stored length-prefix encoded.
>
> The single bit <code>has_value</code> indicates whether the node
> stores a data bytestring, the value associated with its key prefix.
> Since keys may be any value or length, including one key being a
> prefix of another, it is possible for interior nodes in addition to
> leaf nodes to have values associated with them, and therefore an
> explicit value-existence bit is required.
>
> The remaining three bits are used for proof extraction, and are masked
> away prior to hash operations. <code>prune_left</code> indicates that
> the entire left branch has been pruned. <code>prune_right</code> has
> similar meaning for the right branch. If <code>has_value</code> is
> set, <code>prune_value</code> may be set to exclude the node's value
> from encoded proof. This is necessary field for interior nodes, since
> it is possible that their values may be pruned while their children
> are not.
>
> The <code>value</code> field is only present if the bit
> <code>flags.has_value</code> is set, in which case it is a
> <code>VARCHAR</code> bytestring:
>
> switch flags.has_value:
> case 0:
> value := ε
> case 1:
> value := VARCHAR(node.value)
>
> The <code>extra</code> field is always present, and takes on a
> bytestring value defined by the particular application. Use of the
> <code>extra</code> field is application dependent, and will not be
> covered in this specification. It can be set to the empty bytestring
> (serialized as a single zero byte) if the application has no use for
> the <code>extra</code> field.
>
> value := VARCHAR(calculate_extra(node))
>
> The <code>left</code> and <code>right</code> non-terminals are only
> present if the corresponding <code>flags.left_flags</code> or
> <code>flags.right_flags</code> are non-zero. The format depends on the
> value of this flags setting:
>
> switch branch_flags:
> case 0:
> branch := ε
> case 1:
> branch := branch_node_or_hash
> case 2:
> prefix = prefix >> 1
> branch := int_to_byte(1 << len(prefix) | bits_to_int(prefix)) ||
> branch_node_or_hash
> case 3:
> prefix = prefix >> 1
> branch := VARINT(len(prefix) - 9) ||
> bits_to_string(prefix) ||
> branch_node_or_hash
>
> <code>branch_flags</code> is a stand-in meant to describe either
> <code>left_flags</code> or <code>right_flags</code>, and likewise
> everywhere else in the above pseudocode <code>branch</code> can be
> replaced with either <code>left</code> or <code>right</code>.
>
> <code>prefix</code> is the key bits between the current node and the
> next branching, terminal, and/or leaf node, including the implicit
> leading bit for the branch (0 for the left branch, 1 for the right
> branch). In the above code, <code>len(prefix)</code> returns the
> number of bits in the bitstring, and <code>prefix >> 1</code> drops
> the first bit reducing the size of the bitstring by one and
> renumbering the indices accordingly.
>
> The function <code>int_to_byte</code> takes an integer in the range
> [0, 255] and returns the octet representing that value. This is a NOP
> in many languages, but present in this pseudocode so as to be explicit
> about what is going on.
>
> The function <code>bits_to_int</code> interprets a sequence of bits as
> a little-endian integer value. This is analogous to the following
> pseudocode:
>
> def bits_to_int(bits):
> result = 0
> for idx in 1..len(bits):
> if bits[idx] == 1:
> result |= 1<<idx
>
> The function <code>bits_to_string</code> serializes a sequence of bits
> into a binary string. It uses little-endian bit and byte order, as
> demonstrated by the following pseudocode:
>
> def bits_to_string(bits):
> bytes = [0] * ceil(len(bits) / 8)
> for idx in 1..len(bits):
> if bits[idx] == 1:
> bytes[idx / 8] |= 1 << idx % 8
> return map(int_to_byte, bytes)
>
> <code>branch_node_or_hash</code> is either the serialized child node
> or its SHA-256 hash and associated meta-data. Context determines which
> value to use: during digest calculations, disk/database serialization,
> and when the branch is pruned the hash value is used and serialized in
> the same way as other SHA-256 values in the bitcoin protocol (note
> however that it is single-SHA-256, not the double-SHA-256 more
> commonly used in bitcoin). The number of terminal (value-containing)
> nodes and the serialized size in bytes of the fully unpruned branch
> are suffixed to the branch hash. When serializing a proof or
> snapshotting tree state and the branch is not pruned, the serialized
> child node is included directly and the count and size are omitted as
> they can be derived from the serialization.
>
> if branch_pruned or SER_HASH:
> branch_node_or_hash := SHA-256(branch) ||
> count(branch) ||
> size(branch)
> else:
> branch_node_or_hash := serialize(branch)
>
> As an example, here is the serialization of a prefix tree mapping the
> names men and women of science to the year of their greatest publication:
>
> >>> dict = AuthTree()
> >>> dict['Curie'] = VARINT(1898)
> >>> dict('Einstein') = VARINT(1905)
> >>> dict['Fleming'] = VARINT(1928)
> >>> dict['中本'] = VARINT(2009)
> >>> dict.serialize()
> # An bytestring, broken out into parts:
>
> # . Root node:
> 0x0e # left_flags: 2, right_flags: 3, has_value: 1
> 0x00 # extra: ε
>
> # .l Inner node: 0b01000
> 0x11 # 0b01000
> 0x07 # left_flags: 3, right_flags: 1
> 0x00 # extra: ε
>
> # .l.l Inner node: 0b01000011 0b01110101 0b01110010 0b01101001
> # 'C' 'u' 'r' 'i'
> # 0b01100101
> # 'e'
> 0x1abb3a599a02 # 0b01101110101011100100110100101100101
> 0x10 # has_value: 1
> 0x00 # extra: ε
> 0x03fd6a07 # value: VARINT(1911)
>
> # .l.r Inner node: 0b010001
> 0x0f # left_flags: 3, right_flags: 3
> 0x00 # extra: ε
>
> # .l.r.l Inner node: 0b01000101 0b01101001 0b01101110 0b01110011
> # 'E' 'i' 'n' 's'
> # 0b01110100 0b01100101 0b01101001 0b01101110
> # 't' 'e' 'i' 'n'
> 0x312ded9c5d4c2ded00 # 0b1011010010110111
> # 0b0011100110111010
> # 0b0011001010110100
> # 0b101101110
> 0x10 # has_value: 1
> 0x00 # extra: ε
> 0x03fd7107 # value: VARINT(1905)
>
> # .l.r.r Inner node: 0b01000110 0b01101100 0b01100101 0b01101101
> # 'F' 'l' 'e' 'm'
> # 0b01101001 0b01101110 0b01100111
> # 'i' 'n' 'g'
> 0x296c4c6d2dedcc01 # 0b0011011000110010
> # 0b1011011010110100
> # 0b10110111001100111
> 0x10 # has_value: 1
> 0x00 # extra: ε
> 0x03fd8807 # value: VARINT(1928)
>
> # .r Inner node: 0b11100100 0b10111000 0b10101101
> # '中'
> # 0b11100110 0b10011100 0b10101100
> # '本'
> 0x27938edab39c1a # 0b1100100101110001
> # 0b0101101111001101
> # 0b001110010101100
> 0x10 # has_value: 1
> 0x00 # extra: ε
> 0x03fdd907 # value: VARINT(2009)
>
> ==Hashing==
>
> There are two variations of the authenticated prefix tree presented in
> this draft BIP. They differ only in the way in which hash values of a
> node and its left/right branches are constructed. The variations,
> discussed below, tradeoff computational resources for the ability to
> compose operational proofs. Whether the performance hit is
> significant, and whether or not the added features are worth the
> tradeoff depends very much on the application.
>
> ===Variation 1: Level-compressed hashing===
>
> In this variation the referenced child node's hash is used in
> construction of an interior node's hash digest. The interior node is
> serialized just as described (using the child node's digest instead of
> inline serialization), the resulting bytestring is passed through one
> round of SHA-256, and the digest that comes out of that is the hash
> value of the node. This is very efficient to calculate, requiring the
> absolute minimum number of SHA-256 hash operations, and achieving
> level-compression of computational resources in addition to reduction
> of space usage.
>
> For example:
>
> >>> dict = AuthTree()
> >>> dict['a'] = 0xff
> >>> dict.serialize()
> 0x0200c3100001ff
> >>> dict.root
> AuthTreeNode(
> left_prefix = 0b01100001,
> left_hash =
> 0xbafa0e2bba3396c5e9804b6cbe61be82bc442c1121aed81f8d5de36e9b20dc2f,
> left_count = 1,
> left_size = 4)
> >>> dict.hash
> 0xb4837376022a7c9ddaa7d685ad183bcbd5d16c362b81fa293a7b9e911766cf3c
>
> Assuming uniform distribution of key values, level-compressed hashing
> has time-complexity logarithmic with respect to the number of keys in
> the prefix tree. The disadvantage is that it is not possible in
> general to "rebase" an operational proof on top of a sibling,
> particularly if that sibling deletes branches that result in
> reorganization and level compression of internal nodes used by the
> rebased proof.
>
> ===Variation 2: Proof-updatable hashing===
>
> In this variation, level-compressed branches are expanded into a
> series of chained single-branch internal nodes, each including the
> hash of its direct child. For a brach with a prefix N bits in length,
> this requires N chained hashes. Thanks to node-compression (excluding
> empty branches from the serialization), it is possible for each hash
> operation + padding to fit within a single SHA-256 block.
>
> Note that the serialization semantics are unchanged! The variation
> only changes the procedure for calculating the hash values of interior
> nodes. The serialization format remains the same (modulo differing
> hash values standing in for pruned branches).
>
> Using the above example, calling <code>dict.hash</code> causes the
> following internal nodes to be constructed:
>
> >>> node1 = AuthTreeNode(
> right_prefix = 0b1,
> right_hash =
> 0xbafa0e2bba3396c5e9804b6cbe61be82bc442c1121aed81f8d5de36e9b20dc2f,
> right_count = 1,
> right_size = 4)
> >>> node2 = AuthTreeNode( left_prefix=0b0, left_hash=node1.hash,
> left_count=1, left_size=4)
> >>> node3 = AuthTreeNode( left_prefix=0b0, left_hash=node2.hash,
> left_count=1, left_size=4)
> >>> node4 = AuthTreeNode( left_prefix=0b0, left_hash=node3.hash,
> left_count=1, left_size=4)
> >>> node5 = AuthTreeNode( left_prefix=0b0, left_hash=node4.hash,
> left_count=1, left_size=4)
> >>> node6 = AuthTreeNode(right_prefix=0b1, right_hash=node5.hash,
> right_count=1, right_size=4)
> >>> node7 = AuthTreeNode(right_prefix=0b1, right_hash=node6.hash,
> right_count=1, right_size=4)
> >>> node8 = AuthTreeNode( left_prefix=0b0, left_hash=node7.hash,
> left_count=1, left_size=4,
> value=0xff)
> >>> dict.hash == node8.hash
> True
> >>> dict.hash
> 0xc3a9328eff06662ed9ff8e82aa9cc094d05f70f0953828ea8c643c4679213895
>
> The advantage of proof-updatable hashing is that any operational proof
> may be "rebased" onto the tree resulting from a sibling proof, using
> only the information locally available in the proofs, even in the
> presence of deletion operations that result in level-compression of
> the serialized form. The disadvantage is performance: validating an
> updatable proof requires a number of hash operations lower-bounded by
> the length of the key in bits.
>
> ==Inclusion proofs==
>
> An inclusion proof is a prefix tree pruned to contain a subset of its
> keys. The serialization of an inclusion proof takes the following form:
>
> inclusion_proof := variant || root_hash || root_node || checksum
>
> Where <code>variant</code> is a single-byte value indicating the
> presence of level-compression (0 for proof-updatable hashing, 1 for
> level-compressed hashing). <code>root_hash</code> is the Merkle
> compression hash of the tree, the 32-byte SHA-256 hash of the root
> node. <code>tree</code> is the possibly pruned, serialized
> representation of the tree. And finally, <code>checksum</code> is the
> first 4 bytes of the SHA-256 checksum of <code>variant</code>,
> <code>root_hash</code>, and <code>root_node</code>.
>
> For ease of transport, the standard envelope for display of an
> inclusion proof is internet-standard base64 encoding in the following
> format:
>
> - -----BEGIN INCLUSION PROOF-----
> ATzPZheRnns6KfqBKzZs0dXLOxithdan2p18KgJ2c4O0DgARBwAauzpZmgIQAAP9agcPADEt7Zxd
> TC3tABAAA/1xBylsTG0t7cwBEAAD/YgHJ5OO2rOcGhAAA/3ZByEg+2g=
> - -----END INCLUSION PROOF-----
>
> Decoded, it looks like this:
>
> 0x01 # Level-compressed hashing
> # Merkle root:
> 0x3ccf6617919e7b3a29fa812b366cd1d5cb3b18ad85d6a7da9d7c2a02767383b4
> # Serialized tree (unpruned):
> 0x0e001107001abb3a599a02100003fd6a070f00312ded9c5d4c2ded00100003fd
> 0x7107296c4c6d2dedcc01100003fd880727938edab39c1a100003fdd907
> # Checksum:
> 0x2120fb68
>
> ==Operational proofs==
>
> An operational proof is a list of insert/update and delete operations
> suffixed to an inclusion proof which contains the pathways necessary
> to perform the specified operations. The inclusion proof must contain
> the key values to be updated or deleted, and the nearest adjacent key
> values for each insertion. The serialization of an operational proof
> takes the following form:
>
> operational_proof := variant || root_hash || tree ||
> VARLIST(delete*) || VARLIST(update*) ||
> new_hash || checksum
>
> delete := VARCHAR(key)
> update := VARCHAR(key) || VARCHAR(value)
>
> The first three fields, <code>variant</code>, <code>root_hash</code>,
> and <code>tree</code> are the inclusion proof, and take the same
> values described in the previous section. <code>deletes</code> is a
> list of key values to be deleted; each key value in this list must
> exist in the inclusion proof. <code>updates</code> is a list of key,
> value mappings which are to be inserted into the tree, possibly
> replacing any mapping for the key which already exists; either the key
> itself if it exists (update), or the two lexicographically closest
> keys on either side if it does not (insert) must be present in the
> insertion proof. <code>new_hash</code> is the resulting Merkle root
> after the insertion, updates, and deletes are performed, and
> <code>checksum</code> is the initial 4 bytes of the SHA-256 hash of
> the preceding fields.
>
> Just like inclusion proofs, an operational proof is encoded in base64
> for display and transport. Here's the same
>
> - -----BEGIN OPERATIONAL PROOF-----
> ATzPZheRnns6KfqBKzZs0dXLOxithdan2p18KgJ2c4O0LgARaIsVaQi/GdhOPOgA8p4Pu4PiEfEg
> lcmy3j7bOc7hXw0DLSeTjtqznBoQAAP92QcBMOS4reacrACzuZJbyP7fqIOf5VEk4iarG4+uPoZC
> oun8BztQMQBy0LHVeSY=
> - -----END OPERATIONAL PROOF-----
>
> Decoded and broken into its constituent fields:
>
> 0x01 # Level-compressed hashing
> # Original Merkle root:
> 0x3ccf6617919e7b3a29fa812b366cd1d5cb3b18ad85d6a7da9d7c2a02767383b4
> # Serialized tree (included keys: '中本'):
> 0x2e0011688b156908bf19d84e3ce800f29e0fbb83e211f12095c9b2de3edb39ce
> 0xe15f0d032d27938edab39c1a100003fdd907
> # Deletion list ['中本']:
> 0x01
> 0x30e4b8ade69cac
> # Insertion list []:
> 0x00
> # New Merkle root:
> 0xb3b9925bc8fedfa8839fe55124e226ab1b8fae3e8642a2e9fc073b50310072d0
> # Checksum:
> 0xb1d57926
>
> ~End of File~
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.14 (GNU/Linux)
> Comment: GPGTools - http://gpgtools.org
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>
> iQIcBAEBAgAGBQJSs6HIAAoJEAdzVfsmodw4gooQAJm7XNsZjgdeTSpKIvUIU38f
> tQx2FD08hQdLl48me5mDUbHJgGlYINsKAgoZ8Mqwi/kHEEYhuIlLIX1p6Ovigidb
> 21BiVoOLdG1egGOwxp17DuwYaDPTppFTlN9TBjZzW6WKc7+4aNvyc1KtrbHIhtj/
> 04ekFyAn4U5UH0ht7CI79j0u3Kp85p5D4PyYZB2m82mzti6OxpSM4tXlMkDW7ihg
> QJwiZSjzejqTd7WF0zr0SLeGVRSN2A0dzUCoVsI98eIa3hkw2N4ae6dRkibyStOT
> V8VEDvHArEDlvu8jiryajhsom5mvtOOclNDkVXWAf/Te4gj05iYdTIvNvDEJtqsP
> XDbmw6GgV1kBLlLo0mp//t/+wr+nIvy+sVAP+eqtM/0vjaVXBkXxkUMqqNkrtVpB
> f3whq7nFahssUMSoWE93jgob1ayAax2XUALVMAXYsJl7b2MqBGlhiTZ8FQZ+TW4A
> tIpKeUprPmDvA18rO3SCbmLMQryZqYiH0sRyvUc5kvn3qCRHrISZNkEuK591eS+x
> BO1eOluPzVqeXPPSK1jvGeY0FNJtwzbov4nI1mzOvzQHLCvkHn5PhUFCK5tL5tAe
> b0Z5qwDV+SvVs7W1R7ejYBzEj77U1zuzZ9AtikOuvy+bNGrkIlpI49EyXHijm7C3
> Q6JacTuI0PelYji2gaBJ
> =BbDs
> -----END PGP SIGNATURE-----
>
> ------------------------------------------------------------------------------
> Rapidly troubleshoot problems before they affect your business. Most IT
> organizations don't have a clear picture of how application performance
> affects their revenue. With AppDynamics, you get 100% visibility into your
> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
> http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
> _______________________________________________
> Bitcoin-development mailing list
> Bitcoin-development at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/bitcoin-development