This is one of the most important developments in nostr. We getting rsync for nostr queries. Incredible work, thank you Doug! Time to start implementing this in nostrdb.
npub1yxpr...qud4
Hi all! I have just pushed a major update to the negentropy project. It implements protocol version 1 (previous was version 0). The protocol has been simplified: ID size is no longer configurable, there are no more continuation ranges, and the output can be constructed in the same input-scan pass. There is a new fingerprint function, based on an incremental hash. This allows fingerprints to be pre-computed and stored in a tree structure. The C++ implementation includes a B+Tree implementation that allows fingerprints to be computed without collecting IDs from the entire DB. I have written a comprehensive article that goes over the theory of RBSR and the negentropy implementation: https://logperiodic.com/rbsr.html Comments are appreciated! Finally, I have integrated the new version of negentropy and the B+Tree with strfry. It's in a development branch and not quite ready for production, but my testing indicates this will be a massive improvement for full-DB syncs, especially on relays with very large DBs. In-sync or nearly in-sync relays should sync almost instantly with negligible resource usage. Relays will also use the B+Tree for filters that contain only until and/or since, meaning that date-range full-DB syncs are also efficient. Syncing arbitrary filters works as before (but I have begun work to make these more efficient as well). Unfortunately, this is a breaking change for the negentropy protocol (this really should be the last one!) and will also require a new DB version. I'm going to take this opportunity to make a few more breaking DB changes, and plan to release strfry 1.0 after a beta testing period.
View quoted note →

Replies (18)

relay a has some set of notes relay b had another set of notes How do you sync these relays efficiently? Also from a client perspective: how do i query and download notes i don’t already have efficiently. No longer do we need to download tons of redundant data.
You wouldn't even bother. Random sample a smaller set and average for a quick answer that's good enough for the accuracy you want at the moment. If you want to know more exactly, take a larger data set and sample that based on how much time and resources you have. No?
Why not. That’s all they will ever be anyway. Everything is an estimate except all the information itself, which you can never have all in one place simultaneously in this case anyway, due to the decentralization that is growing rapidly. Think on it or what just my 2 sats.
It's not capable of guaranteeing complete synchronisation of things. I'm even working on (in my part time) a filter plugin that quickly recognises whether a reply post references a whitelisted identity, by building filters out of posts made by the user - for the purpose of caching first level replies to posts so the relay has it immediately. I'm going to be using bloom filters to build a recogniser, not sure how the state is going to be stored but probably replaceable private events for the filter itself as a virtual user.
just forgot my second point: compact, addressable merkle trees let you guarantee full set membership discovery, assuming that the user didn't post them to a relay not known by the one trying to find it. But it can know the ID, and it can know the set of a given batch of them that have been bundled into a tree.
i use it regularly. it's an extension of cp/scp that does lots of stuff like scanning the directory tree and reading filesizes and datestamps and - i'm not sure, i don't think it includes generating hashes. IMO, filesystems have several advancements they need but we are still stuck in late 70s unix filesystem designs. A few technologies work with the base and take it further but sadly not enough. A number of things that people associate with Macintosh relate to this. Apple did a lot of effort with making better filesystems but sadly their user-oriented advances have not been widely made use of. It seems to me like, for example, that all files should have a hash in their record that can be quickly accessed. But no. All such systems only create mirror images of the filesystem, if at best, inside a database, no filesystem exists afaik, that keeps a current hash of the file consist.
I should just add also, there has been quite a lot of rigidity with this. A lot of "filetypes" actually have filesystems inside them. Hell, HTML files encode a tree. Hypertext links are like filesystem links across the network. The technology is still very immature. The benefits of extending the filesystem have yet to be realised, and I'm already 40 years in on this from my early experiences with simple tape recorder soud modulated recording systems, and better ways already existed but just have not been implemented, since at least 30 years. There isn't enough computer programmers, and partly this is because programming languages are overly complex and build systems unfriendly and labor intensive.