Doug Hoyte on Nostr: Got it, makes sense. I think if you make sure the event is written before updating ...
Got it, makes sense. I think if you make sure the event is written before updating the DB then that would be safe, as you describe. It still might be worthwhile to check if there is any performance difference between the parallel file and storing the validated event JSON in sled DB along with the indices. If it works like LMDB then retrieving it will be functionally identical (an offset into an mmap), and an event write will require only one fsync, not two. It also might save you a lot of trouble later on, when it comes to backups and periodic rewrites you mention.
About the indices: No, I don't think it's a dumb algorithm. It's actually a really difficult problem, to figure out which index or indices to use in a scan. strfry has all the indices you mention, also composite with created_at. Additionally, there is a (pubkey,kind,created_at) index, which I think is useful because I believe queries like "get contact list of user X" are quite common.
strfry currently never joins two indices together or does a set intersection as you describe -- its algorithm is even dumber. It only ever uses 1 index. If the query can be fully satisfied by the index then it doesn't need to load the flatbuffers (most queries are like this). But if it does need to load the flatbuffers this is when it's important that accessing them and applying filters are fast. Optimisations for this code path are also beneficial for the live events filtering (ie post EOSE).
Your approach sounds like it might be quite effective. The only thing I'd watch out for is if your IO patterns are excessively random. The nice thing about working on one index at once is that the access is largely contiguous data. Even several contiguous streams is probably OK, but some queries have like hundreds of pubkeys in them (for instance).
Published at
2023-05-08 16:19:54Event JSON
{
"id": "20f8e20cf6a712c603cc0e9e27f9425ae6b153ad79dd35376a84c5db6dd4d57f",
"pubkey": "218238431393959d6c8617a3bd899303a96609b44a644e973891038a7de8622d",
"created_at": 1683562794,
"kind": 1,
"tags": [
[
"e",
"f70d6db9005ce00a8abf314caa16463c8b1b0b2cc1fe8ecebc144d1823a0bd74",
"",
"root"
],
[
"e",
"000008a19876f86e894f34f94d8b54e38383a54799f8b707f9b0ed7c5495c108",
"",
"reply"
],
[
"p",
"218238431393959d6c8617a3bd899303a96609b44a644e973891038a7de8622d"
],
[
"p",
"0c99877612291bd818b3dd92f2852b823557b3744c3cb10470865c7a56a4929b"
],
[
"p",
"79c2cae114ea28a981e7559b4fe7854a473521a8d22a66bbab9fa248eb820ff6"
],
[
"p",
"2d5b6404df532de082d9e77f7f4257a6f43fb79bb9de8dd3ac7df5e6d4b500b0"
],
[
"p",
"1bc70a0148b3f316da33fe3c89f23e3e71ac4ff998027ec712b905cd24f6a411"
],
[
"p",
"ee11a5dff40c19a555f41fe42b48f00e618c91225622ae37b6c2bb67b76c4e49"
]
],
"content": "Got it, makes sense. I think if you make sure the event is written before updating the DB then that would be safe, as you describe. It still might be worthwhile to check if there is any performance difference between the parallel file and storing the validated event JSON in sled DB along with the indices. If it works like LMDB then retrieving it will be functionally identical (an offset into an mmap), and an event write will require only one fsync, not two. It also might save you a lot of trouble later on, when it comes to backups and periodic rewrites you mention.\n\nAbout the indices: No, I don't think it's a dumb algorithm. It's actually a really difficult problem, to figure out which index or indices to use in a scan. strfry has all the indices you mention, also composite with created_at. Additionally, there is a (pubkey,kind,created_at) index, which I think is useful because I believe queries like \"get contact list of user X\" are quite common.\n\nstrfry currently never joins two indices together or does a set intersection as you describe -- its algorithm is even dumber. It only ever uses 1 index. If the query can be fully satisfied by the index then it doesn't need to load the flatbuffers (most queries are like this). But if it does need to load the flatbuffers this is when it's important that accessing them and applying filters are fast. Optimisations for this code path are also beneficial for the live events filtering (ie post EOSE).\n\nYour approach sounds like it might be quite effective. The only thing I'd watch out for is if your IO patterns are excessively random. The nice thing about working on one index at once is that the access is largely contiguous data. Even several contiguous streams is probably OK, but some queries have like hundreds of pubkeys in them (for instance).",
"sig": "747c70caa9683df0443554d8089efb60fe43823227d46d6e171858d741cae2bac29123aab5701af3ad021b56f040f5a9e666b0f45b1a1214fb6ad1ee8752e02a"
}