Doug Hoyte on Nostr: > If you’re sending binary, base64 only inflates by 33% compared to hex strings’ ...
> If you’re sending binary, base64 only inflates by 33% compared to hex strings’ 100%.
On purely random data, after compression hex is usually only ~10% bigger than base64. For example:
$ head -c 1000000 /dev/urandom > rand
$ alias hex='od -A n -t x1 | sed "s/ *//g"'
$ cat rand |hex|zstd -c|wc -c
1086970
$ cat rand |base64|zstd -c|wc -c
1018226
$ wcalc 1086970/1018226
= 1.06751
So only ~7% bigger in this case. When data is not purely random, hex often compresses *better* than base64. This is because hex preserves patterns on byte boundaries but base64 does not. For example, look at these two strings post-base64:
$ echo 'hello world' | base64
aGVsbG8gd29ybGQK
$ echo ' hello world' | base64
IGhlbGxvIHdvcmxkCg==
They have nothing in common. Compare to the hex encoded versions:
$ echo 'hello world' | hex
68656c6c6f20776f726c640a
$ echo ' hello world' | hex
2068656c6c6f20776f726c640a
The pattern is preserved, it is just shifted by 2 characters. This means that if "hello world" appears multiple times in the input, there may be two different patterns for it in Base64, but only one in hex (meaning hex effectively has a 2x larger compression dictionary).
Since negentropy is mostly (but not entirely) random data like hashes and fingerprints, it's probably a wash. However, hex is typically faster to encode/decode and furthermore is used for almost all other fields in the nostr protocol, so on the whole seems like the best choice.
> Personally, I’d prefer to see the message format specified explicitly as debuggable JSON, if feasible.
This is theoretically possible, but it would be very difficult to interpret/debug it anyway, and it would add a lot of bandwidth/CPU overhead.
Published at
2024-09-13 21:23:19Event JSON
{
"id": "775a19aff0631d117bc02dba568bcaae3668aa62fb8829949bab4b824e496bc7",
"pubkey": "218238431393959d6c8617a3bd899303a96609b44a644e973891038a7de8622d",
"created_at": 1726262599,
"kind": 1,
"tags": [
[
"client",
"oddbean"
],
[
"e",
"8623be03632bdf85e140c275fdf7f39b20e1393c381164f5b25662724663058a",
"",
"root"
],
[
"e",
"3af39991a14aba1a6989ef12df454d6333e022261af1c605e63e453dc5650b51",
"",
"reply"
],
[
"p",
"218238431393959d6c8617a3bd899303a96609b44a644e973891038a7de8622d"
],
[
"p",
"3bf0c63fcb93463407af97a5e5ee64fa883d107ef9e558472c4eb9aaaefa459d"
],
[
"p",
"6140478c9ae12f1d0b540e7c57806649327a91b040b07f7ba3dedc357cab0da5"
]
],
"content": "\u003e If you’re sending binary, base64 only inflates by 33% compared to hex strings’ 100%.\n\nOn purely random data, after compression hex is usually only ~10% bigger than base64. For example:\n\n $ head -c 1000000 /dev/urandom \u003e rand\n $ alias hex='od -A n -t x1 | sed \"s/ *//g\"'\n $ cat rand |hex|zstd -c|wc -c\n 1086970\n $ cat rand |base64|zstd -c|wc -c\n 1018226\n $ wcalc 1086970/1018226\n = 1.06751\n\nSo only ~7% bigger in this case. When data is not purely random, hex often compresses *better* than base64. This is because hex preserves patterns on byte boundaries but base64 does not. For example, look at these two strings post-base64:\n\n $ echo 'hello world' | base64\n aGVsbG8gd29ybGQK\n $ echo ' hello world' | base64\n IGhlbGxvIHdvcmxkCg==\n\nThey have nothing in common. Compare to the hex encoded versions:\n\n $ echo 'hello world' | hex\n 68656c6c6f20776f726c640a\n $ echo ' hello world' | hex\n 2068656c6c6f20776f726c640a\n\nThe pattern is preserved, it is just shifted by 2 characters. This means that if \"hello world\" appears multiple times in the input, there may be two different patterns for it in Base64, but only one in hex (meaning hex effectively has a 2x larger compression dictionary).\n\nSince negentropy is mostly (but not entirely) random data like hashes and fingerprints, it's probably a wash. However, hex is typically faster to encode/decode and furthermore is used for almost all other fields in the nostr protocol, so on the whole seems like the best choice.\n\n\u003e Personally, I’d prefer to see the message format specified explicitly as debuggable JSON, if feasible.\n\nThis is theoretically possible, but it would be very difficult to interpret/debug it anyway, and it would add a lot of bandwidth/CPU overhead.",
"sig": "0b8884cbe464eaab20894df536e3da1d0d84ad1271f1470c5a88b7eb37a4f1a2c77b7efec56906100ff2a1017b87407042c986ac750c65e0cdd693308806ab3d"
}