Skip to content

TLS DNS: take into account partial writes by SSL_write_ex()

Michal Nowak requested to merge 4255-suspicious-messsage-id-fix into bind-9.18

This commit changes TLS DNS so that partial writes by the SSL_write_ex() function are taken into account properly. Now, before doing encryption, we are flushing the buffers for outgoing encrypted data.

The problem is fairly complicated and originates from the fact that it is somewhat hard to understand by reading the documentation if and when partial writes are supported/enabled or not, and one can get a false impression that they are not supported or enabled by default (https://www.openssl.org/docs/man3.1/man3/SSL_write_ex.html). I have added a lengthy comment about that into the code because it will be more useful there. The documentation on this topic is vague and hard to follow.

The main point is that when SSL_write_ex() fails with SSL_ERROR_WANT_WRITE, the OpenSSL code tells us that we need to flush the outgoing buffers and then call SSL_write_ex() again with exactly the same arguments in order to continue as partial write could have happened on the previous call to SSL_write_ex() (that is not hard to verify by calling BIO_pending(sock->tls.app_rbio) before and after the call to SSL_write_ex() and comparing the returned values). This aspect was not taken into account in the code.

Now, one can wonder how that could have led to the behaviour that we saw in the #4255 (closed) bug report. In particular, how could we lose one message and duplicate one twice? That is where things get interesting.

One needs to keep two things in mind (that is important):

Firstly, the possibility that two (or more) subsequent SSL_write_ex() calls will be done with exactly the same arguments is very high (the code does not guarantee that in any way, but in practice, that happens a lot).

Secondly, the dnsperf (the software that helped us to trigger the bug) bombed the test server with messages that contained exactly the same data. The only difference in the responses is message IDs, which can be found closer to the start of a message.

So, that is what was going on in the older version of the code:

  1. During one of the isc_nm_send() calls, the SSL_write_ex() call fails with SSL_ERROR_WANT_WRITE. Partial writing has happened, though, and we wrote a part of the message with the message ID (e.g. 2014). Nevertheless, we have rescheduled the complete send operation asynchronously by a call to tlsdns_send_enqueue().

  2. While the asynchronous request has not been completed, we try to send the message (e.g. with ID 2015). The next isc_nm_send() or re-queued send happens with a call to SSL_write_ex() with EXACTLY the same arguments as in the case of the previous call. That is, we are acting as if we want to complete the previously failed SSL_write_ex() attempt (according to the OpenSSL documentation: https://www.openssl.org/docs/man3.1/man3/SSL_write_ex.html, the "Warnings" section). This way, we already have a start of the message containing the previous ID (2014 in our case) but complete the write request with the rest of the data given in the current write attempt. However, as responses differ only in message ID, we end up sending a valid (properly structured) DNS message but with the ID of the previous one. This way, we send a message with ID from the previous isc_nm_send() attempt. The message with the ID from the send request from this attempt will never be sent, as the code thinks that it is sending it now (that is how we send the message with ID 2014 instead of 2015, as in our example, thus making the message with ID 2015 never to be sent).

  3. At some point later, the asynchronous send request (the rescheduled on the first step) completes without an error, sending a second message with the same ID (2014).

It took exhausting SSL write buffers (so that a data encryption attempt cannot be completed in one operation) via long DoT streams in order to exhibit the behaviour described above. The exhaustion happened because we have not been trying to flush the buffers often enough (especially in the case of multiple subsequent writes).

In my opinion, the origin of the problem can be described as follows:

It happened due to making wrong guesses caused by poorly written documentation.

Closes #4255 (closed)


This has been originally developed and reviewed in isc-private/bind9!572.

Merge request reports