đź“… Original date posted:2016-05-09
đź“ť Original message:Hi Pieter,
> I tried to derive what length of short ids is actually necessary (some
> write-up is on
> https://gist.github.com/sipa/b2eb2e486156b5509ac711edd16153ed but it's
> incomplete).
>
> For any reasonable numbers I can come up with (in a very wide range),
> the number of bits needed is very well approximated by:
>
> log2(#receiver_mempool_txn * #block_txn_not_in_receiver_mempool /
> acceptable_per_block_failure_rate)
>
> For example, with 20000 mempool transactions, 2500 transactions in a
> block, 95% hitrate, and a chance of 1 in 10000 blocks to fail to
> reconstruct, needed_bits = log2(20000 * 2500 * (1 - 0.95) / 0.0001) =
> 34.54, or 5 byte txids would suffice.
>
> Note that 1 in 10000 failures may sound like a lot, but this is for each
> individual connection, and since every transmission uses separately
> salted identifiers, occasional failures should not affect global
> propagation. Given that transmission failures due to timeouts, network
> connectivity, ... already occur much more frequently than once every few
> gigabytes (what 10000 blocks corresponds to), that's probably already
> more than enough.
>
> In short: I believe 5 or 6 byte txids should be enough, but perhaps it
> makes sense to allow the sender to choose (so he can weigh trying
> multiple nonces against increasing the short txid length).
[9 May 16 @ 11am PDT]
We worked on this with respect to “Xthin" for Bitcoin Unlimited, and came to a similar conclusion.
But we (I think it was theZerg) also noticed another trick: if the node receiving the thin blocks has a small number of collisions with transactions in its mempool (e.g., 1 or 2), then it can test each possible block against the Merkle root in the block header to determine the correct one. Using this technique, it should be possible to further reduce the number of bytes used for the txids. That being said, even thin blocks built from 64-bit short IDs represent a tremendous savings compared to standard block propagation. So we (Bitcoin Unlimited) decided not to pursue this optimization any further at that time.
***
It’s also interesting to ask what the information-theoretic minimum amount of information necessary for a node to re-construct a block is. The way I’m thinking about this currently[1] is that the node needs all of the transactions in the block that were not initially part of its mempool, plus enough information to select and ordered subset from that mempool that represents the block. If m is the number of transactions in mempool and n is the number of transactions in the block, then the number of possible subsets (C') is given by the binomial coefficient:
C' = m! / [n! (m - n)!]
Since there are n! possible orderings for each subset, the total number of possible blocks (C) of size n from a mempool of size m is
C = n! C’ = m! / (m-n)!
Assuming that all possible blocks are equally likely, the Shannon entropy (the information that must be communicated) is the base-2 logarithm of the number of possible blocks. After making some approximations, this works out very close to
minimum information ~= n * log2(m),
which for your case of 20,000 transactions in mempool (m = 20,000) and a 2500-transaction block (n = 2500), yields
minimum information = 2500 * log2(20,000) ~ 2500 * 15 bits.
In other words, a lower bound on the information required is about 2 bytes per transactions for every transaction in the block that the node is already aware of, as well as all the missing transactions in full.
Of course, this assumes an unlimited number of round trips, and it is probably complicated by other factors that I haven’t considered (queue the “spherical cow” jokes :), but I thought it was interesting that a technique like Xthin or compact blocks is already pretty close to this limit.
Cheers,
Peter
[1] There are still some things that I can’t wrap my mind around that I’d love to discuss with another math geek :)