Why Nostr? What is Njump?
2024-07-11 13:08:33
in reply to

Fabio Manganiello on Nostr: should probably elaborate a bit more, since there's no silver bullet in databases and ...

should probably elaborate a bit more, since there's no silver bullet in databases and all pros and cons come with ifs and buts.

Concurrency is a problem mostly if no constraints other than the primary key are enforced. If an individual's primary key is supposed to be unique, but there's no unique constraint on their SSN, then a record with `User(ssn=1234)` published both on `stream1` and `stream2`, and consumed by `db_service1` and `db_service2` respectively, can easily result in data pollution / duplication.

Same goes for the upsert issue - if I get `Employee(ssn=1234, job=engineer)`, and later `Employee(ssn=1234, job=manager)`, and no constraints other than the synthetic primary key are enforced, how do I know that the second record is probably an update on the previous one rather than a new one?

The obvious solution would be to have a `UNIQUE` constraint on `ssn` besides the one on the synthetic primary key. And then have an `INSERT INTO...ON CONFLICT UPDATE` or equivalent logic to ensure consistency/causality/idempotency. But then such a constraint would functionally behave like a primary key (in E/R theory it would be called "alternative key"), with the same issues described in the article (i.e. burden of proof of uniqueness) but without the benefits of a real primary key (i.e. free indexing and efficient joins).

Another solution would be to let the application prehemptively fetch potential duplicates before the insert/update (e.g. by querying any employees with SSN in a certain list) and then perform insert/update accordingly. But this would move storage logic to the business logic, with all the attached downsides - you need to lock the records or the table while you do a read+write, followed probably by another read to get the IDs, and this can cause serious performance degradation on write-heavy systems.

The insert+fetch issue instead happens if you have primary keys as auto-increment integers or random UUIDs defined as db column defaults. In that case, you'll have to do a SELECT after your INSERT to get the inserted ID (again, performance overhead). The mitigation in this case is usually to generate the ID in the application logic rather than on the db. But then we come back to the concurrency issue - having a centralized mechanism for generating IDs has its benefits, and removing that often means having to create an external central service that does the job that the db should do.

And sharding becomes an issue only if the primary key is completely random. Keys generated in e.g. UUIDv7 mitigate that issue, as UUIDv7 is time-sortable and good for sharding, but that means reducing all the possible sharding dimensions to time alone - which is the same thing you would have with an auto-increment.

All these trade-offs are the reason why I usually prefer deterministic primary keys. Which doesn't necessarily mean natural. Most of the times I actually use UUID keys, but those UUIDs are often hashes of the columns that I know to be "immutable" - calculated through flexible-length hashing algorithms like e.g. FNV-1a. If there aren't enough columns in the table to guarantee uniqueness, I usually synthesize my own (e.g. event time, creation date, namespace...) until I'm reasonably confident that the chances of conflict are low. Sure, if any of the "immutable" columns turns out to be mutable (e.g. a CF/CPR that changes after a change of name/gender), then the application has to deal with the creation of a new record + rewiring + invalidation of the old record. But in my experience this is usually a rare exception, not the norm. In my experience is better to have migration procedures in place to handle with these cases, but leverage all the benefits that come from deterministic keys and atomic transactions, than doing the other way around.
Author Public Key
npub1s9uc08n58mxqk5umvapqulwzng0sja635q86r36d8n4rr9r9ygaskjdnmu