Sometimes I stumble upon this kind of post on twitter and wonder how this will scale up on Nostr. image 9k comments, 29k repost, 422k likes. Each of this actions is an event on Nostr. If likes can be quite small (say less than a kb each) repost carry the whole original event. And comments well can vary in size, but we can assume around 2kb per event. Nostr also fetch many times the same event if outbox is implemented. So this post on Nostr would require ~ 0.75GB of metadata if fetched from one single relay. I don't have a good answer to this scaling problem, but if we are lucky, we'll need to think hard about a solution here.

Replies (30)

The events not posted to the recipient read relay. I would imagine that by the time we get to that level of engagement, users would have their relay set correctly.
489 * 29_000 489 * 422_000 2_000 * 9_000 that's 240MB if we move to binary encoding (assuming that the kind 1111 stays the same, because it was a guess anyways) we drop almost 100MB: 150MB but, for simple likes like what are returned by twitter's ui, you can run a COUNT query, which returns like 1 kb :) run one for likes, one for quotes, 2kb, then the 9k comments fit neatly in 18MB
Yeah we really just need to query a relay for counts. And relays should generally implement caching, which is used to very good effect and lowers bandwidth when using an aggregator relay.
Yes the count query seems like the obvious fix here. I just haven’t seen it implemented in any client yet, no idea if any Relay even support it yet. A count query has its limitation too, you have to blindly trust the relay, and have to choose only one because you can’t deduplicate 2 different count response. Cannot verify event signature too obviously. But I agree the count query will become mandatory at some point. It just seems like this kind of query is not a relay job but more a @Vertex one, just like follow count.
There is NIP-45 for counts, but it's not widely supported and it's quite cumbersome for relays as it probably requires a separate tags table, and some indices as well. I know because I've built my own. Anyway, I kinda agree, adding too much features to relays reduces one of the best feature of nostr: simplicity.
There is NIP-45 for counts, but it's not widely supported and it's quite cumbersome for relays as it probably requires a separate tags table, and some indices as well. I know because I've built my own. Anyway, I kinda agree, adding too much features to relays reduces one of the best feature of nostr: simplicity.
nice, will have a look at Nastro and nip-45. I wonder if there is any kind of cryptographic proof involved in a count response, doubt it.
There is NIP-45, that is used to count how many events match a certain filter. So the proof ideally would prove that X (the returned count) events with valid signatures actually matched the filter(s).
i must make a note to implement count on #orly - the database is designed to avoid needing to decode events before it is certain they are result candidates, so it can actually do it, at least for author/kind/timestamp filters, without decoding a single event stored in the database. probably i need to add a new index that covers the gap with tags, like pubkey/tag/timestamp, and the count function never has to decode any events. my main objection to count had to do with the excessive work required that was basically the same as if you actually just fetch the event and do that processing locally. but with the necessary indexes that doesn't have to happen. trading off the size of the database for the speed of search and count is what i'm talking about. there definitely is algorithms that would run a lot faster if they could do counts
> my main objection to count had to do with the excessive work required that was basically the same as if you actually just fetch the event and do that processing locally For a client these aren't even comparable. 100k events + signature checks vs 1 COUNT
yes but if the relay has to decode all 100k of those events, there's not really any gain. count searches need to be as efficient as possible because of extreme numbers like 100k. you simply wouldn't even request that much if you wanted the whole thing so it's mad if the relay's database literally has to parse the whole damn thing, just for a headcount.
yeah, i build my DB from scratch using a key/value store so such a search index has to be added. it has part of it already but it doesn't cover the full filter query space, only pubkey/id/timestamp. it also needs tags and kinds. the extra index eliminates the need to decode events by allowing queries to effectively search by those criteria and not sift through such inordinately large volumes of data. i've chalked in a mental note to implement count properly for the near future, it's not a big task really.
ah got it. Yeah ofc in a key-value store you have to implement that from scratch. Btw if you ever need a SQLITE package for nostr events, I wrote this It provides bare bone tables, but in the constructor you are free to add other schemas, or triggers, and you can even change or modify the default Query builder to take advantage of new tables, indices and so on.
well, i'm so used to building badger based databases i have a bunch of libraries for it, to manage building keys and and generating indexes. doesn't take me that much longer to make an embedded, dedicated database with some particular set of tables this way compared to learning some yet another new thing to work with antiquated sql query engines. i mean, it's just my thing, building things like this so yeah. i write the encoders and the database engines and the network schedulers. offloading stuff to inherently slower, less efficient, less scalable fixed-purpose flexible database engines just doesn't make sense for me, and when you see how fast it returns results the reason is clear. IMO, nostr is a unique enough type of network connected database protocol, so open ended, that using generic general purpose parts to build it is a big sacrifice in performance.
If you are interested, you can open a PR to nastro adding your own badger based DB. I would actually really appreciate it. The prerequisite is just to satisfy the store interface, expose the internals (so people can make queries directly if needed), and allow for customisation. I think it would be quite beneficial to the ecosystem to move away from the event store fiatjaf packages, which is quite crappy
yeah, the event store interface is kinda wrong too. it should be sorting the results before sending them, just not depending on the insertion time of the record to be a match with it (this breaks when someone uploads a bunch of old events to a relay). i mean, at least, where sqlite or postgresql or others are used, the results are sorted in the engine, but the badger engine that fiatjaf made does not, and it runs WAY too many threads. create an issue on the repo and tag me as assigned to the issue of adding a badger based DB engine implementation. i'll get to it in the next month. it's not a difficult task, just looking at the interface, creating all the stub implementations of the methods and then putting the right bits into them based on the existing engine i have built. probably a day or so work maybe.
> i mean, at least, where sqlite or postgresql or others are used, the results are sorted in the engine, but the badger engine that fiatjaf made does not oooof, not good. > and it runs WAY too many threads typical... > create an issue on the repo and tag me as assigned to the issue of adding a badger based DB engine implementation That's great, thank you sir, I'll do ASAP!
It’s definitely doable but can take quite some time and resources, so it might be a fit only for archiving old data. Eg for 422k likes it would be several hours and up to 1TB of ram which would cost $6-8 per hour. The cost can be justified if the events are quite sensitive but def not for ordinary posts:) We do expect 10-100x perf improvements in the nearest future though so it might become viable at some point
Looking forward to it guys. Maybe it doesn't help, but I'll throw it out there. HyperLogsLogs give an estimate of the cardinality of a set, and are one of the most efficient ways to reduce memory/storage requirement when it comes to counting unique elements. If you don't know, they work by keeping track of the max of the number of consecutive 0s (or 1s, doesn't matter) in the hash of the elements. example "test1" --> hash --> 01001111 (4 consecutive "1"s) "test2" --> hash --> 10101101 (1 consecutive "1") Maybe there is a ZK version of an HLL, where I can prove that the maximum number of consecutive "1"s was X, which allow me to estimate the cardinality of the set
Relay protocol needs updates to support - aggregate counts - content based search filtering - request events where author contained in a list identified by ID or atag Clients continue to federate requests and collate responses