Doug Hoyte - Nostr Hypermedia

Doug Hoyte 1 year ago

Should I block AI web crawlers on Oddbean? On oddbean.com I see a *lot* of web crawling traffic from AI bots like GPTBot hoovering up nostr notes presumably for training purposes. I guess it's probably one of the easiest nostr sites to crawl since everything is rendered as plain HTML and they don't need to execute JS code to query relays. To avoid wasting bandwidth I decided to use the following method to soft-block them (honour-system robots.txt):

Cory Dransfeldt

Go ahead and block AI web crawlers • Cory Dransfeldt

AI companies are crawling the open web to, ostensibly, improve the quality of their models and products. This process is extractive and accrues the...

You could argue they're just wasting my resources and won't bring any visitors or benefit the nostr community in any way. On the other hand, I guess they can/will access this data in some other way, and maybe the world-at-large gets some modicum of benefit from better AI models (?). Thoughts? #asknostr

Doug Hoyte 1 year ago

Truncating text is complicated. Today I spent some time fixing some bugs on oddbean.com that I've been putting off for a while. Most just involved some uninteresting grunt work, but there's one that is a huge rabbit hole and, if you've never thought about it before you may be surprised at how deep it goes. On Oddbean, we only show the first ~100 characters of a nostr note and then cut it off ("truncate" it). This is all well and good, except some titles got an unexpected weird character at the end: Nostr Advent Calendar 2024 の 11 日目の記事を書きました。 2024年のNostrリレ�… Now, I'm no expert on Japanese script but I'm pretty sure that diamond question mark character is not supposed to be there. What gives? The answer is that almost all text on the web is encoded in UTF-8, which is a multi-byte Unicode encoding. That means that these Japanese characters actually take up 3 bytes, unlike Latin letters which take up 1. Oddbean was taking the first 100 bytes and cutting it off there. Unfortunately, that left an incomplete UTF-8 encoded code point which the browser replaces with a special replacement character (U+FFFD, the diamond question mark). OK, easy fix right? Just do substr() on the code-points (not the UTF-8 encoding). Sure, but that is quite inefficient, requiring a pass over the data. Fortunately there is a more efficient way to fix this that relies on the fact that UTF-8 is a self-synchronising code, meaning you can always find nearest code point boundaries no matter where in the string you jump to. So that is what I did:

GitHub

strfry/src/apps/web/WebUtils.h at 863edcff17834af5f51654b546e34de965382756 · hoytech/strfry

a nostr relay. Contribute to hoytech/strfry development by creating an account on GitHub.

Problem solved right? Well, that depends on your definition of "solved". Notice above I've been referring to "code points" instead of characters? In many languages such as English we can pretty much get away with considering these the same. However in other scripts this is not the case. Sometimes what we think of as a character can actually require multiple code-points. For example, the character 'â' can be represented as 'a' followed by a special ' ̂' combining character. Most common characters such as â *also* have dedicated code-points, and which representation is used depends on the Unicode Normal Form. You may also have seen country flags represented by two composite characters, or emoji alterations such as skin tone -- it's the same principle. Cutting in between such characters will cause truncation artifacts. So rather than "character" (which is an imprecise notion), Unicode refers to Extended Grapheme Clusters, which correspond as closely as possible with what we think of as individual atoms of text. You can read more than you ever wanted to know about this here:

UAX #29: Unicode Text Segmentation

Note that many langauges need special consideration when cutting on graphemes (or indeed words, lines etc). Especially Korean Hangul script is interesting, having been designed rather than evolved like most writing systems -- in fact it's quite elegant! So my hack for Oddbean doesn't do all this fancy grapheme truncation, and that's because I know if I tried I would end up in a seriously deep rabbit hole. I know because I have and I did! 10 years ago I published the following Perl module:

Client Challenge

I'm pretty proud of this yak shave, because of the implementation. I was able to adapt the regular expressions from Unicode TR29, compose them with a UTF-8 regular expression, and compile it all with the Ragel state machine compiler (

Ragel State Machine Compiler

). As a result, it can both validate UTF-8 and (correctly!) truncate in a single-pass. If you want (a lot) more Unicode trivia, I also made a presentation on this topic:

Truncating Unicode

Doug Hoyte 1 year ago

nostr has no global source of truth, and that is a good thing Out of interest, I follow the progress of a lot of other projects similar to nostr, and a couple links surfaced today: BlueSky has a big "firehose" connection that streams all updates (new posts, reactions, etc) to subscribers. Unsurprisingly, this is difficult to process except on beefy servers with lots of bandwidth. So, one proposed solution is to strip out all that pesky cryptography (signatures, merkle tree data, etc):

Jaz’s Blog

Jetstream: Shrinking the AT Proto Firehose by >99%

Jetstream is a streaming service that consumes an AT Proto Sync Firehose and converts it into lightweight, filterable, friendly JSON, allowing us t...

And over on Farcaster, keeping their hubs in sync is too difficult, so they want to make all posts globally sequenced, like a blockchain. The details are still being worked out, but I think it's safe to assume there will be a privileged global sequencer who decides on this ordering (and possibly which posts are included at all):

GitHub

FIP: Introducing Ordering · farcasterxyz/protocol · Discussion #193

Problem Farcaster hubs are finding it harder to stay in sync. The problem is that user data changes at a very rapid pace. If you aren’t familiar ...

In my opinion, both of these issues are symptoms of an underlying errant philosophy. These projects both want there to be a global source of truth: A single place you can go to guarantee you're seeing all the posts on a thread, from a particular user, etc. On BlueSky that is

Blue Sky

and on Farcaster that is

Farcaster

A decentralized social network

. Advocates of each of these projects of course would dispute this, pointing out that you could always self-host, or somehow avoid depending on their semi-official infrastructure, but the truth is that if you're not on bluesky.app or warpcast.com, you don't exist, and nobody cares that you don't exist. nostr has eschewed the concept of global source of truth. You can't necessarily be sure you are seeing everything. Conversations may sometimes get fragmented, posts may disappear, and there may be the occasional bout of confusion and chaos. There is no official or semi-official nostr website, app, or relay, and this is a good thing. It means we are actually building a decentralised protocol, not just acting out decentralisation theatre, or pretending we'll get there eventually and that the ends justify the means. Back when computers were primitive and professional data-centres didn't exist, it was impossible to build mega-apps like Twitter. Protocols had to be decentralised by default -- there was simply no other way. We can learn a lot by looking back to protocols of yesteryear, like Usenet and IRC, and still-popular protocols like email and HTTP. None of these assume global sources of truth, and they are stronger and better for it, as is nostr.

Doug Hoyte 1 year ago

I just tagged strfry 1.0.0. Here are some of the highlights: * negentropy protocol 1: This is the result of a lot of R&D on different syncing protocols, trying to find the best fit for nostr. I'm pretty excited about the result. Negentropy sync has now been allocated NIP 77. * Better error messages for users and operators. * Docs have been updated and refreshed. * Lots of optimisations: Better CPU/memory usage, smaller DBs. Export/import has been sped up a lot: 10x faster or more. This should help reduce the pain of DB upgrades (which is required for this release). Instructions on upgrading are available here:

GitHub

GitHub - hoytech/strfry: a nostr relay

a nostr relay. Contribute to hoytech/strfry development by creating an account on GitHub.

Thanks to everyone who has helped develop/debug/test strfry over the past 2 years, and for all the kind words and encouragement. The nostr community rocks! We've got a few things in the pipeline for strfry: * strfry proxy: This will be a new feature for the router that enables intelligent reverse proxying for the nostr protocol. This will help scale up mega-sized relays by allowing the storage and processing workload to be split across multiple independent machines. Various partitioning schemes will be supported depending on performance and redundancy requirements. The front-end router instances will perform multiple concurrent nostr queries to the backend relays, and merge their results into a single stream for the original client. * As well as scaling up, reverse proxying can also help scale down. By dynamically incorporating relay list settings (NIP-65), nostr queries can be satisfied by proxying requests to external relays on behalf of a client and merging the results together along with any matching cached local events. Negentropy will be used where possible to avoid wasting bandwidth on duplicate events. * Archival mode: Currently strfry stores all events fully indexed in its main DB, along with their full JSON representations (optionally zstd dictionary compressed). For old events that are queried infrequently, space usage can be reduced considerably. As well as deindexing, we are planning on taking advantage of columnar storage, aggregation of reaction events, and other tricks. This will play nicely with strfry proxy, and events can gradually migrate to the archival relays. * Last but not least, our website https://oddbean.com is going to get some love. Custom algorithms, search, bugfixes, better relay coverage, and more!

Doug Hoyte 1 year ago

I just tagged 2 strfry releases: 0.9.7 and 1.0.0-beta1 1.0.0-beta1 is a candidate release of strfry 1.0.0 -- help is needed testing! The internal strfry DB version has been increased to 3, which means that you will need to rebuild your DBs to use this new version. 0.9.7 has some bugfixes and changes that accumulated in master, and has a "strfry export --fried" feature that can be used to create DB exports that can be rapidly imported by 1.0.0 series releases. The full changelogs are available here:

GitHub

strfry/CHANGES at 1f34794945cf380035822278c46070a9923129f3 · hoytech/strfry

a nostr relay. Contribute to hoytech/strfry development by creating an account on GitHub.

Thank you to everyone who contributed and helps testing! If you need help or run into any issues, reply on nostr or stop by our telegram channel.

Doug Hoyte 1 year ago

Exporting and importing events into a new strfry instance (which you need to do when the DB version changes) takes too long. Here's a feature I just added that speeds this up a lot:

GitHub

strfry/docs/fried.md at next · hoytech/strfry

a nostr relay. Contribute to hoytech/strfry development by creating an account on GitHub.

Going forward, there is a release-0.9 branch. I'm going to tag one more release on that branch soon (after back-porting a couple fixes). It will have strfry export --fried but not import. I'm planning on this being the last release of the 0.9 series. I'm working on a 1.0 release in the next branch. I just did a big refactor of the DB format that I've wanted to do for some time. I also removed prefix matching on id/pubkey (this was removed from NIP-01) and fixed a bunch of bugs. This release will also have the latest negentropy protocol version and BTree code.

Doug Hoyte 2 years ago

Hi all! I have just pushed a major update to the negentropy project. It implements protocol version 1 (previous was version 0). The protocol has been simplified: ID size is no longer configurable, there are no more continuation ranges, and the output can be constructed in the same input-scan pass. There is a new fingerprint function, based on an incremental hash. This allows fingerprints to be pre-computed and stored in a tree structure. The C++ implementation includes a B+Tree implementation that allows fingerprints to be computed without collecting IDs from the entire DB. I have written a comprehensive article that goes over the theory of RBSR and the negentropy implementation:

Range-Based Set Reconciliation | Log Periodic

Comments are appreciated! Finally, I have integrated the new version of negentropy and the B+Tree with strfry. It's in a development branch and not quite ready for production, but my testing indicates this will be a massive improvement for full-DB syncs, especially on relays with very large DBs. In-sync or nearly in-sync relays should sync almost instantly with negligible resource usage. Relays will also use the B+Tree for filters that contain only until and/or since, meaning that date-range full-DB syncs are also efficient. Syncing arbitrary filters works as before (but I have begun work to make these more efficient as well). Unfortunately, this is a breaking change for the negentropy protocol (this really should be the last one!) and will also require a new DB version. I'm going to take this opportunity to make a few more breaking DB changes, and plan to release strfry 1.0 after a beta testing period.

Doug Hoyte 3 years ago

Hey all! Here's a new nostr relay implementation I'm working on:

GitHub

GitHub - hoytech/strfry: a nostr relay

a nostr relay. Contribute to hoytech/strfry development by creating an account on GitHub.

It's still in beta/dev, but pretty close to production ready. Most interesting feature is a merkle-tree based set reconciliation protocol for syncing messages between servers.