TCP4RECVERR counter mis-counting on 9.16+?
Via (Support Ticket #17204](https://support.isc.org/Ticket/Display.html?id=17204), Dave F. tells us of a problem with the TCP4RECVERR counter that sounds as though we are probably mis-counting in releases that use the new netmgr.
We're attempting upgrade from 9.14.12 to 9.16.[5,6,7]. Upon upgrade we started getting internal alarms firing because the Bind internal stat TCP4RecvErr increases on every DNS TCP query we send. The DNS queries get valid DNS responses, so if we hadn't been alarming on the stat we never would have noticed an immediate issue.
...
We saw this problem on 9.16.5 and then repro'd it on .6 and .7.
To help with the debug, I am including some stat blocks.
This is the initial state of the stats before we issue the queries:
{
"json-stats-version":"1.5",
"boot-time":"2020-10-08T19:44:32.194Z",
"config-time":"2020-10-08T20:47:37.818Z",
"current-time":"2020-10-08T20:55:05.270Z",
"version":"9.16.7",
"sockstats":{
"UDP4Open":2,
"TCP4Open":3,
"RawOpen":1,
"TCP4Close":243,
"TCP4Accept":244,
"TCP4RecvErr":88,
"UDP4Active":3,
"TCP4Active":4,
"RawActive":1
},
We then run two TCP queries using the same version of dig 9.16.7 built with this version of bind under test.
bash-4.2$ dig @127.0.0.1 localhost a +tcp
; <<>> DiG 9.16.7 <<>> @127.0.0.1 localhost a +tcp
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 4021
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: ba9ced48a62b6061010000005f7f7cc12056d6b3fad00a0c (good)
;; QUESTION SECTION:
;localhost. IN A
;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Thu Oct 08 20:55:29 UTC 2020
;; MSG SIZE rcvd: 66
bash-4.2$ dig @127.0.0.1 <redacted> TXT +tcp
; <<>> DiG 9.16.7 <<>> @127.0.0.1 <redacted> TXT +tcp
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 28547
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: bca3ab06023e0f15010000005f7f7cc7c9512f87cfdff056 (good)
;; QUESTION SECTION:
;<redacted>. IN TXT
;; ANSWER SECTION:
<redacted>. 3600 IN TXT "OK"
;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Thu Oct 08 20:55:35 UTC 2020
;; MSG SIZE rcvd: 101
Note that we have a REFUSED response and an NOERROR response, but both valid DNS responses with no sign of TCP error.
After running these queries the stat block looks like:
bash-4.2$ curl http://localhost:8053/json/v1/net
{
"json-stats-version":"1.5",
"boot-time":"2020-10-08T19:44:32.194Z",
"config-time":"2020-10-08T20:47:37.818Z",
"current-time":"2020-10-08T20:55:37.681Z",
"version":"9.16.7",
"sockstats":{
"UDP4Open":2,
"TCP4Open":3,
"RawOpen":1,
"TCP4Close":246,
"TCP4Accept":247,
"TCP4RecvErr":90,
"UDP4Active":3,
"TCP4Active":4,
"RawActive":1
},
Note that TCP4RecvErr has increased by 2, corresponding to the 2 queries. This is the concern.
TCP4Close and TCP4Accept both increase by 3 from the baseline due to the 2 queries plus the curl request to obtain the stats. This seems to be normal.
Please let me know if I can provide any more information.
We'd like to understand if this is a known issue and if there are any other concerns (memory leak, socket leak, performance issue) associated with the error.