Rusty Russell's avatar
Rusty Russell
rusty@rusty.ozlabs.org
npub179e9...lz4s
Lead Core Lightning, Standards Wrangler, Bitcoin Script Restoration ponderer, coder. Full time employed on Free and Open Source Software since 1998. Joyous hacking with others for over 25 years.
I'm slowly coming around to the following roadmap: 1. Simplified CTV for whole-tx commitments (ie you *will* spend this output using a tx which exactly like X. 2. Optimised sponsors for solving the "but how do I add fees" problem in a way that doesn't drive miner centralisation. 3. Script restoration so we can don't have arbitrary limits on things like amount arithmetic and examination sizes. 4. Introspection opcode(s) so we can examine txs flexibly. 5. Script enhancements for things like merkle proofs (e.g Taproot trees) and tweaks, checksig. You could argue that #1 is simply an optimisation of #3/#4, and that's true, but it's also an obvious case (once you have #2) that we will still want even when we have all the rest.
Some nice coat bug repots coming in, from real usage. A nice "please submit a bug report" message came in (obviously, a case I thought would never get hit!). So my weekend (remember how I said I wouldn't be working all hours now the release is close? Ha!) has been occupied thinking about this. On the surface, this happens when we exactly fill the maximum capacity of a local channel, then go to add fees and can't fit (if we hit htlc max, we split into two htlcs for guys case). We should go back and ask our min-cost-flow solver for another route for the part we can't afford. This is almost certain to fail, though, because there was a reason we were trying to jam the entire thing down that one channel. But what's more interesting is what's actually happening: something I managed to accidentally trigger in CI for *another* test. See, we fail a payment at the time we get the peer's sig on the tx with them HTLC removed. But after that, there's another round trip while we clear the HTLC from the peer's tx. The funds in flight aren't *really* available again until that completes. This matters for xpay, which tends to respond to failure by throwing another payment out. This can fail because the previous one hasn't totally finished (in my test, it wasn't out of capacity, but actually hit the total dust limit, but it's the same effect: gratuitous failure on the local channel). Xpay assumes the previous failure is caused by capacity limits, and reduces the capacity estimate of the local channel (it should know the capacity, but other operations or the peer could change it, so it tries not to assume). Eventually, this capacity estimate becomes exactly the payment we are trying to make, and we hit the "can't add fees" corner case. There are four ways to fix this: 1. Allow adding a new htlc while the old one is being removed. This seems spec-legal but in practice would need a lot of interop testing. 2. Don't fail htlcs until they're completely cleared. But the sooner we report failure the sooner we can start calculating more routes. 3. If a local error happens, wait until htlcs are fully clear and try again. 4. Wait inside "injectpaymentonion" until htlcs are clear. We're at rc2, so I'm going mid-brain on this: wait for a second and retry if this happens! Polling on channel htlcs is possible, but won't win much for this corner case. Longer term, inject could efficiently retry (it can trigger on the htlc vanishing, as it's inside lightningd). But that's more code and nobody will ever care
It is important to empathize with frustrated users. It's sometimes an unattainable ideal, but who hasn't hit software that Just Doesn't Work? We don't really care if it's just something about our setup, or fundamentally broken, or a completely unhelpful error message: it's an incredibly frustrating feeling of impotence. Sure, you shouldn't take it out on the devs you aren't paying, but we're all human. I can't speak for all developers, but I became a FOSS coder in the Linux Kernel. That gave me a pretty thick skin: Linus could be an ass, and even when he was wrong there was no appeal. So I generally find it easier to sift through the users' frustrations and try to get to the problem they are having. And often it turns out, I agree! This shit should just Work Better! CLN payments are the example here, and it was never my priority. That might seem weird, but the first production CLN node was the Blockstream store. So we're good at *receiving* payments! But the method of routing and actually making payments is neither spec-defined nor a way to lose money. It's also hard to measure success properly, since it depends on the vagaries of the network at the time But it's important, turns out :). And now we see it first-hand since we host nodes at Greenlight. So this release, unlike most, was "get a new pay system in place" (hence we will miss our release date, for the first time since we switched to date-based releases). Here's a list of what we did: 1. I was Release Captain. I was next in the rotation anyway, but since this was going to be a weird release I wanted to take responsibility. 2. I wrote a compressor for the current topology snapshot. This lets us check a "known" realistic data set into the repo for CI. 3. I wrote a fake channel daemon, which uses the decompressed topology to simulate the entire network. 4. I pulled the min-cost-flow solver out of renepay into its own general plugin, "askrene". This lets anyone access it, lets @lagrange further enhance it, and makes it easier for custom pay plugins to exist: Michael of Boltz showed how important this is with mpay. 5. A new interface for sending HTLCs, which mirrors the path of payments coming from other nodes. In particular, this handles self-pay (including payments where part is self-pay and part remote!) and blinded path entry natively, just like any other payment. 6. Enhancements and cleanups to our "libplugin" library for built-in plugins, to avoid nasty hacks pay has to do. 7. Finally, a new "xpay" command and plug-in. After all the other work, this was fairly simple. In particular, I chose not to be bound to the current pay API, which is a bit painful in the short term. 8. @Alex changed our gossip code to be more aggressive: you can't route if you can't see the network well! Importantly, I haven't closed this issue: we need to see how this works in the Real World! Engineers always love rewriting, but it can actually make things worse as lessons are lost, and workarounds people were using before stop being effective. But after this fairly Herculean effort, I'm going to need to switch to other things for a while. There are always other things to work on!
Writing release notes is fun, but the part I really like in the release process is preparing the first commits for the *next* release: 1. BOLT spec updates. We check all the BOLT quotes in our source, and have a script to update the spec version one commit at a time. This is a grab bag of typo fixes, feature merges (which may mean we no longer need our local patches), and occasionally major changes. It's unpredictable enough that I enjoy it 2. Removing long-deprecated features. We now give a year, then you can enable each deprecated feature individually with a configuration flag, then (if we haven't heard complaints!) we finally remove it. This means removing code (usually ugly shim code) and is a genuine joy. I've started this for 25.02, and it's a balm after the release grind...
The problem with CTV is fees. When you look at most designs using CTV, they need *another* tx, and an anchor output, so they can pay fees. What they really want is "this tx, plus an input and optional change output". People tend to ignore fees in their protocol design. But I'm implementation they're critical, and only getting more so. Lightning has been down this path!
Trying to do everything, but mainly doing a little of everything badly. #CLN release is late, I've been ignoring the BOLTs changes,, I haven't even looked at GSR since my OP_NEXT talk, and my comprehensive list of opcodes for introspection is still stuck in my head. I try to take time to post on #nostr over morning coffee though, since doing that every day helps move things forward.
To recycle an old joke: "Bitcoiners' self-worth reaches 90k." The price is like the weather. People can discuss it endlessly, but unless you're going out into it, it's trivia. Don't be the price. It's boring.
After a long day going through all 44 issues and pull requests for the milestone, and lots of hard work, I've got it down to 49. This could take a while...
Here are the very draft interpreter and BIPs for Script Restoration. I'm seeking comments (looking at @Rijndael and @Steven Roose here!) on all aspects, including format and what opcodes make sense to combine into a single proposal. There are definitely some tweaks we could make to costing (it bugs me that the formula for OP_MUL is not symmetric, for example!), but that part is pretty solid. Once we have a single proposal that we think is well-specified, reasonable and useful, it'll be time to get coauthors (I have a day job, and it's not this!) and formally propose it for broader scrutiny. And interpreter (not other parts, so not actually usable except bench and test!):
Really nice, thoughtful panel at OP_NEXT with @Peter Todd and Jay Beddict. Good questions from Pete Rizzo, felt like we could have gone much, much, longer...
Woke at 6am thinking about Bitcoin. Couldn't get back to sleep, so I finished first pass of my budget estimation util. You give it a script, if gives you a maximum budget it could use, or a maximum input element size it could handle. It's approximate (doesn't actually evaluate properly), but it's an upper bound. I need to come up with some good examples though...