Here's a left-side-of-the-bell-curve way to do the Internet Archive "right": - Create browser extension - User loads page - User clicks "archive" button - Whatever is in user's browser gets signed & published to relays - Archival event contains URL, timestamp, etc. - Do OpenTimestamps attestation via NIP-03 - ??? - Profit I'm sure there's a 100 details I'm glossing over but because this is user-driven and does all the archiving "on the edge" it would just work, not only in theory but very much so in practice. The reason why the Internet Archive can be blocked is because it is a central thing, and if users do an archival request they don't do the archiving themselves, they send the request to a central server that does the archiving. And that central server can be blocked.

Replies (101)

Gigi's avatar Gigi
Here's a left-side-of-the-bell-curve way to do the Internet Archive "right": - Create browser extension - User loads page - User clicks "archive" button - Whatever is in user's browser gets signed & published to relays - Archival event contains URL, timestamp, etc. - Do OpenTimestamps attestation via NIP-03 - ??? - Profit I'm sure there's a 100 details I'm glossing over but because this is user-driven and does all the archiving "on the edge" it would just work, not only in theory but very much so in practice. The reason why the Internet Archive can be blocked is because it is a central thing, and if users do an archival request they don't do the archiving themselves, they send the request to a central server that does the archiving. And that central server can be blocked.
View quoted note →
Yes, it needs to be stored decentralized. Maybe torrents can also be used in a way to make it more resilient?
waxwing's avatar
waxwing 4 months ago
Tlsnotary style proofs are kind of needed imo.
HoloKat's avatar
HoloKat 4 months ago
Maybe a way to zap most active archivers
I'm not saying this isn't an issue with Internet Archive, but would it be easy to spoof page contents? If so, it would be neat to allow other npubs to sign and verify content is accurate
I'm sure you're not glossing over any details academics might disagree but they're dead inside
I fear this would lead to soooo much personal information being doxxed on accident. I would never risk to click such a button. Maybe we can make sure the extension scans for PID first, but still…
The NIP on attestations that @Nathan Day developed addresses this issue. Originally for proof of place but has been generalized. Ultimately you have to trust the attestor.
Owner_of_donky's avatar
Owner_of_donky 4 months ago
The issue with user-driven archiving is that not all users have good intentions. One could try and will probably succeed trying to upload malicious "copies" of a website
Karadenizli's avatar
Karadenizli 4 months ago
Well yes, that's their whole service though. They are used because people trust them. We need to provide at least as much confidence as them.
noname's avatar
noname 4 months ago
web archiving is a childs play. what we want to do is decentralize web crawling data. nobody uses yacy which implicate it failed. what is the total size of latest common crawl? (estimate) Compressed size (gzip‑ed WARC) 250‑350β€―TB solve this.
Scott's avatar
Scott 4 months ago
I would view one person archiving a webpage differently than many people archiving that same webpage.
There's no reason a service(s) couldn't coexist alongside or within a network of individuals. Said services could even recruit contributors from the wider network, or make a service out of "verifying" and storing network events that they deem "accurate", on their own relays. The 2 things would help create accountability for both.
But web pages are often served differently to different visitors. There can be location /personalization variations of exactly the same page. Even to different logged-out, β€œanonymous” users. Also, a β€œpage” is more than a single page, there are usually dozens of associated files like css, js, and images. Just saving the html won’t get you very far when you want to look at it again later.
That is a very integrating suggestion. E.g. the hash of the single archived page should match the hash of the original single page.
you could normalize some params like browser engine, agent, javascript on/off, timezone, language, etc. it would probably won't match exactly as you probably have trackers and different scripts
Whenever the HTML that’s rendered in your browser contains some personal information (e.g. an email, your legal name, whatever), it would be included in the archive page and signed by you. If you are not really really careful about what the extension includes in the page, you could leak information that you don’t want to share. The same with stuff that might not even be visible to you. Imagine a newspaper that has a profile page modal for logged in users. The modal is part of the HTML, but hidden via css until you open it. HTML scrapers would still include all the data that is part of the hidden model, without it ever being visible on the users visit
Default avatar
? 4 months ago
Interesting idea. Would it be enough to make a screenshot of the website, hash it and timestamp it?
It records a session, so a whole website if you click everything IMO recording the session is better than trying to crawl the whole site since it captures exactly what you interested in and doesn't get confused by the way websites are built nowadays
Oh! That's a great point. One of the aforementioned 100 things I didn't think about. Solvable issue, but still an issue.
Karadenizli's avatar
Karadenizli 4 months ago
But at that point, that service needs to render and download the page. Then they need to compare it to the page downloaded by the user. You can't use a comparison operator or anything because on most pages, it will vary based on the user and device. You would need manual review and judgement calls by the reviewer which will have to check word by word to see if anything is removed or added, then for display they will have to decide if differences in presentation are valid or not. Basically, checking a page uploaded by another user is far more laborious and complex than just capturing it yourself. The far more likely scenario if this is implemented is, there will not be a separate service verifying pages uploaded by global users, but those services will be the ones capturing and uploading the sites. So the user will still just tell these guys a link for them to screenshot, but there will be multiple competing services that the user can choose from, and they will all be interoperable from multiple clients. While any npub can post a capture, all the clients will curate the captures based on the trustability of the npub and your captures won't be used for anything but curious people looking up shit for reference. Most clients will work on a fully whitelist basis, only showing captures from a few selected npubs, banning them at the first sign of forgery. This is still better than the internet archive, but you will never see crowdsourced web captures that are worth shit to anyone.
This is an urgent case as we live in the last days of "truth". Everything is being manipulated and erased in REAL TIME. Decentralized Internet Archive! Love the idea. LFG!
Dan's avatar
Dan 4 months ago
I’m very left side. Can you just get a bot to auto archive everything
Consider using kind 31. We put some thought into the tags, to meet academic citation standards. { "kind": 31, "pubkey": "<citation-writer-pubkey>", "tags": [ // mandatory tags ["u", "<URL where citation was accessed>"] ["accessed_on", "<date-time in ISO 8601 format>"], ["title", "<title to display for citation>"], ["author", "<author to display for citation>"], // additional, optional tags ["published_on", "<date-time in ISO 8601 format>"], ["published_by", "<who published the citation>"], ["version", "<version or edition of the publication>"], ["location", "<where was it written or published>"], ["g", "<geohash of the precise location>"], ["open_timestamp", "<`e` tag of kind 1040 event>"], ["summary", "<short explanation of which topics the citation covers>"], ], "content": "<text cited>" }
A place where media presented on nostr protocol are stored, compressed to be opened by any client. When you upload a video or image via nistr app you can choose a blossom server that stores it for you. It cannbe one that you run by yourself forever, this way you own your data. You don't want to store everything on the relays for many reasons...
SingleFile can even be set up to push things to an arbitrary API. So it should be possible to jerryrig this quite quickly with some spit and some duct tape.
Very interesting. Thank you. Do you have a good link for beginners to run your own relay?
Yeah, I don't want to click links and deal with paywalls, ads, and trackers. Instead I would like to see a long form note containing the archived page. Notes that contain archived pages can also benefit from disqus-style comments.
Drinnie's avatar
Drinnie 4 months ago
"Whatever is in user's browser gets signed & published to relays ". This is the problem. For paywalled content, how can we be sure that there is no beacon stored somewhere in the page (DOM, js, html) that identifies the subscriber?
the axiom's avatar
the axiom 4 months ago
duct tape is stupid short sighted, you need a decent standard
I hate to bring the bad news but, without some added mechanism, this would only work for truly static pages (i.e. those where the same .html file is served over and over again) whereas most of the content served is tainted with server generated β€œfluffy”, which could range from some ads on the sidebar to the very text of an article being changed. My point is: even if we both visit the same β€œpage” at the same time, it’s more likely than not that we’ll get different versions served, even if they differ only by a few lines of code and would look identical as far as the actual content is concerned.
Let's focus on regular content and cross that bridge if we get there. The main issue is that the big services are centralized and the self-hosted stuff isn't syndicated.
i would have for sure read that years ago, but a great reminder ty gigi i knew they try were trying to use this on music but i didnt realise they were embedding gotcha code so they can police how the user uses their computer, and heaven forbid, copies something. fucking hilarious. hard to imagine why they’re dying such a quick death πŸ˜‚ they’re suiciding themselves. making their product shit all because they cant come to terms with the characteristics of water. i guess we should thank them
I don't think the goal has ever been to make data impossible to copy. The goal is most likely to make copying certain data more difficult. DRM has done that, whether you like it or not. The industry wouldn't do it if it didn't work to some degree.
Default avatar
ihsotas 4 months ago
That’s a good thought . I have an extension I’m working on that bridges the web over to nostr allowing users to create discussions anywhere on the web using nostr. It seems like an archive function would be a solid addition. If I can get the universal grill box idea solid I will work on the archival concept as well.
Default avatar
ihsotas 4 months ago
It’s signed by the user and the reputation becomes king.
Default avatar
ihsotas 4 months ago
You can’t but reputation of the archivist will come into play, you could also have multiple archives and ultimately there would be a consensus.
All the JavaScript getting ingested? Worried about the privacy part but very interesting.
I've been casually vibe coding this since Wednesday. I think it's quite a powerful idea. I have zero experience with making an extension, but it's the first time AI called a project 'seriously impressive' when I threw Gigi's idea in there. So far I have come up with a few additional features but the spec would be this at a minimum: OTS via NIP-03 Blossom for media 3 different types of archiving modes: Forensic Mode: Clean server fetch, zero browser involvement = no tampering Verified Mode: Dual capture (server + local) + automatic comparison = manipulation detection Personal Mode: Exact browser view including logged-in content = your evidence Still debugging Blossom integration and NIP-07 for signing extensions seems tricky. The only caveat is you would need a proxy to run verified + forensic modes, as CORS will block requests otherwise. Not sure how that would be handled other than hosting a proxy. Once I have a somewhat working version I may just throw all the source code out there, I dunno. image Some test archives I've done on a burner account using this custom Nostr archive explorer here. View quoted note β†’
You say this is left-side but there is nothing on the right-side of the curve since what you describe here is already at maximum complexity. And that archiver extension is a mess. But sure, it's a good idea, so it must be done.
I made this extension: https://github.com/fiatjaf/nostr-web-archiver/releases/tag/whatever, which is heavily modified from that other one. Damn, this "Lit" framework for making webgarbage is truly horrible, and this codebase is a mess worse than mine, but I'm glad they have the dirty parts of actually archiving the pages working pretty well. Then there is for browsing archives from others. Please someone test this. If I have to test it again myself I'll cry. I must wait some days now to see if Google approves this extension on their store, meanwhile you can install it manually from the link above.
It works. I'm not sure how to view my own, but my Amber log shows what I think is all the right activities. I'm not sure what the crying is about. This extension is more cooperative than the scrobbler one.
↑