đź“… Original date posted:2022-10-20
đź“ť Original message:
Hi Joost,
I need more time to review your proposed change, but I wanted to quickly
correct a misunderstanding you had in quoting eclair's code:
> Unfortunately it is possible for nodes on the route to hide themselves.
> If they return random data as the failure message, the sender won't know
> where the failure happened. Some senders then penalize all nodes that
> were part of the route [4][5]. This may exclude perfectly reliable nodes
> from being used for future payments.
Eclair's code does not penalize nodes for future payment attempts in this
case. It only ignores them for the retries of that particular payment.
Cheers,
Bastien
Le mer. 19 oct. 2022 Ă 13:13, Joost Jager <joost.jager at gmail.com> a Ă©crit :
> Hi list,
>
> I wanted to get back to a long-standing issue in Lightning: gaps in error
> attribution. I've posted about this before back in 2019 [1].
>
> Error attribution is important to properly penalize nodes after a payment
> failure occurs. The goal of the penalty is to give the next attempt a
> better chance at succeeding. In the happy failure flow, the sender is able
> to determine the origin of the failure and penalizes a single node or pair
> of nodes.
>
> Unfortunately it is possible for nodes on the route to hide themselves. If
> they return random data as the failure message, the sender won't know where
> the failure happened. Some senders then penalize all nodes that were part
> of the route [4][5]. This may exclude perfectly reliable nodes from being
> used for future payments. Other senders penalize no nodes at all [6][7],
> which allows the offending node to keep the disruption going.
>
> A special case of this is a final node sending back random data. Senders
> that penalize all nodes will keep looking for alternative routes. But
> because each alternative route still ends with that same final node, the
> sender will ultimately penalize all of its peers and possibly a lot of the
> rest of the network too.
>
> I can think of various reasons for exploiting this weakness. One is just
> plain grievance for whatever reason. Another one is to attract more traffic
> by getting competing routing nodes penalized. Or the goal could be to
> sufficiently mess up reputation tracking of a specific sender node to make
> it hard for that node to make further payments.
>
> Related to this are delays in the path. A node can delay propagating back
> a failure message and the sender won't be able to determine which node did
> it.
>
> The link at the top of this post [1] describes a way to address both
> unreadable failure messages as well as delays by letting each node on the
> route append a timestamp and hmac to the failure message. The great
> challenge is to do this in such a way that nodes don’t learn their position
> in the path.
>
> I'm revisiting this idea, and have prototyped various ways to implement
> it. In the remainder of this post, I will describe the variant that I
> thought works best (so far).
>
> # Failure message format
>
> The basic idea of the new format is to let each node (not just the error
> source) commit to the failure message when it passes it back by adding an
> hmac. The sender verifies all hmacs upon receipt of the failure message.
> This makes it impossible for any of the nodes to modify the failure message
> without revealing that they might have played a part in the modification.
> It won’t be possible for the sender to pinpoint an exact node, because
> either end of a communication channel may have modified the message.
> Pinpointing a pair of nodes however is good enough, and is commonly done
> for regular onion failures too.
>
> On the highest level, the new failure message consists of three parts:
>
> `message` (var len) | `payloads` (fixed len) | `hmacs` (fixed len)
>
> * `message` is the standard onion failure message as described in [2], but
> without the hmac. The hmac is now part of `hmacs` and doesn't need to be
> repeated.
>
> * `payloads` is a fixed length array that contains space for each node
> (`hop_payload`) on the route to add data to return to the sender. Ideally
> the contents and size of `hop_payload` is signaled so that future
> extensions don’t require all nodes to upgrade. For now, we’ll assume the
> following 9-byte format:
>
> `is_final` (1 byte) | `duration` (8 bytes)
>
> `is_final` indicates whether this node is the failure source. The sender
> uses `is_final` to determine when to stop the decryption/verification
> process.
>
> `duration` is the time in milliseconds that the node held the htlc. By
> observing the series of reported durations, the sender is able to pinpoint
> a delay down to a pair of nodes.
>
> The `hop_payload` is repeated 27 times (the maximum route length).
>
> Every hop shifts `payloads` 9 bytes to the right and puts its own
> `hop_payload` in the 9 left-most bytes.
>
> * `hmacs` is a fixed length array where nodes add their hmacs as the
> failure message travels back to the sender.
>
> To keep things simple, I'll describe the format as if the maximum route
> length was only three hops (instead of 27):
>
> `hmac_0_2` | `hmac_0_1`| `hmac_0_0`| `hmac_1_1`| `hmac_1_0`| `hmac_2_0`
>
> Because nodes don't know their position in the path, it's unclear to
> them what part of the failure message they are supposed to include in the
> hmac. They can't just include everything, because if part of that data is
> deleted later (to keep the message size fixed) it opens up the possibility
> for nodes to blame others.
>
> The solution here is to provide hmacs for all possible positions. The
> last node that updated `hmacs` added `hmac_0_2`, `hmac_0_1` and `hmac_0_0`
> to the block. Each hmac corresponds to a presumed position in the path,
> where `hmac_0_2` is for the longest path (2 downstream hops) and `hmac_0_0`
> for the shortest (node is the error source).
>
> `hmac_x_y` is the hmac added by node x (counted from the node that is
> currently handling the failure message) assuming that this node is y hops
> away from the final node.
>
> Before an hop adds its hmacs, it first deletes some of the previous hmacs.
> This keeps the failure message at a fixed length. The removed hmacs are the
> ones that cannot be useful anymore. If node 0 adds itself, the former node
> 0 (now node 1) cannot be at the first position anymore. The former node 1
> (now node 2) cannot be at the second position anymore. The former node 2
> cannot be the source of the error anymore and isn’t represented in the
> failure message any longer. The corresponding hmacs (the now non-existent
> `hmac_0_2`, `hmac_1_1` and `hmac_2_0`) are deleted by node 0.
>
> Deleting the useless data reduces the number of hmacs (and roughly the
> total failure message size) to half.
>
> The delete operation transform the fields above to:
>
> <empty> | <empty> | <empty> | `hmac_0_1`| `hmac_0_0`| `hmac_1_0`
>
> The exact data that is included in each hmac is:
> * `message`
> * the node’s own `hop_payload` and a set of downstream `hop_payload`s,
> depending on assumed position
> * a set of downstream node hmacs, depending on assumed position
>
> For example `hmac_0_1` is based on:
>
> `message` | `hop_payload[0]` | `hop_payload[1]` | `hmac_1_0`
>
> If the node that is currently handling the failure message is one hop away
> from the final node, it needs to cover its own `hop_payload[0]`, the final
> node `hop_payload[1]` and the final node hmac `hmac_1_0`.
>
> A longer path is committed to in `hmac_0_2`:
>
> `message` | `hop_payload[0]` | `hop_payload[1]` | `hop_payload[2]` |
> `hmac_1_1` | `hmac_2_0`
>
> The current node is two hops away from the final node. It needs to cover
> its own `hop_payload[0]` as well as `hop_payload[1]` and `hop_payload[2]`
> for the next and final hops. Additionally it covers the next hop `hmac_1_1`
> and final hop `hmac_2_0`, which correspond to the positions of those nodes
> in the path that is assumed for `hmac_0_2`.
>
> With this information, the sender is able to verify the longest chain of
> hmacs until it encounters a `hop_payload` with `is_final` set.
>
> If any of the nodes messes with any byte in the failure message, the
> sender is always able to determine a pair of nodes that the offending node
> is part of. This statement can be verified through reasoning, but to be
> sure I also tested it with code. I’ve simulated a malicious node that
> modifies a byte of the failure message at index x and observed the error
> source as determined by the sender. For every x, the sender reports the
> same correct pair.
>
> # Size
>
> The obvious downside of the scheme above is the size. Given a maximum of
> 27 hops, the `hmacs` block contains 27+26+25+...+1=378 hmacs of 32 bytes
> each. This makes for a total size of 12 KB.
>
> It could be the case though that it is not possible to devise a more
> compact scheme that also preserves the existing privacy guarantees. I know
> that smart people have spent time on this problem, but nonetheless no
> better solution has come up in the past years. A proof of its non-existence
> would be interesting for sure.
>
> I personally think the size increase is justified to fix this
> vulnerability in Lightning. Also if failures are expected to become more
> rare going forward, size becomes less relevant to the overall operation of
> the network.
>
> Another option is to reduce the maximum number of hops. It is questionable
> whether 27 hops are really needed in practice, and such long routes also
> contribute to latency and capital lock up. If for example the new failure
> message could only be used with routes up to 10 hops, the total number of
> hmacs would drop from 378 to 55. This makes for a total message size of
> about 2 KB.
>
> # Signaling
>
> For backwards compatibility nodes need to know what algorithm they should
> run to generate or transform the failure message. This can be signaled by
> the sender via a tlv onion field. A failure message format signaling
> mechanism is also discussed in the context of long failure messages [3].
> The failure message described in this post could be just another version.
>
> Additionally, intermediate nodes need to advertise their capability to
> transform the new format through a feature bit.
>
> # Delayed successes
>
> It’s not just failures that can be delayed. Successes can too. In that
> case, there is no failure message to improve. It could be an option to add
> the same `payloads` and `hmacs` blocks to the `update_fulfill_htlc` message.
>
> [1]
> https://lists.linuxfoundation.org/pipermail/lightning-dev/2019-June/002015.html
> [2]
> https://github.com/lightning/bolts/blob/master/04-onion-routing.md#returning-errors
> [3] https://github.com/lightning/bolts/pull/1021
> [4]
> https://github.com/lightningnetwork/lnd/blob/4fbd608b734f348d7e79fbfc7feaecc5c6c33a90/routing/result_interpretation.go#L419
> [5]
>
> https://github.com/ACINQ/eclair/blob/a0433aa0c027c9be618c5afe18e7f91642a7f372/eclair-core/src/main/scala/fr/acinq/eclair/payment/PaymentEvents.scala#L221
> [6]
> https://github.com/ElementsProject/lightning/blob/62bfed9a8df8731be44ba4e86afb08a5d28a4442/plugins/libplugin-pay.c#L1461
> [7]
> https://github.com/lightningdevkit/rust-lightning/blob/e61f3a238a70cbac87209e223b7c396108a49b97/lightning-invoice/src/payment.rs#L682
>
> _______________________________________________
> Lightning-dev mailing list
> Lightning-dev at lists.linuxfoundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/lightning-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linuxfoundation.org/pipermail/lightning-dev/attachments/20221020/fd913402/attachment.html>