Subject | Re: [ib-support] select bug on Linux!! |
---|---|
Author | Brad Pepers |
Post date | 2001-05-03T07:15:41Z |
I've got to stop responding to myself about this!
Just an added note on an easy way to watch some of this happening is to use
isql to a remote database (or using localhost) as I said and then use
"netstat | head" as root on Linux. You can see the TCP ports of the isql and
gds_inet_server programs and observe the Send-Q and Recv-Q fields. If you
leave isql just sitting there, after 60 seconds the Send-Q of the server
suddenly jumps up to 40812 and the Recv-Q of isql to 4008 (at least on my
system). I think the dummy request sent is 4 bytes so this means 10203 dummy
requests have been sent to the isql program! If you do an strace on the
running gds_inet_server you will see its blocking in trying to send even more
4 byte dummy requests.
Now if you try a command in isql and watch, its the clients turn to get into
a loop! The server is blocked waiting to send more data to the client and
the client has sent a request to the server but is not getting a result. The
client times out, gets into the same type of loop, and now floods the server
with some sort of 4 byte dummy request until it too block on send!
Now you've got both queues filled with data. After a while it all sorts out
though sometimes it will end up killing the client.
Just wanted to provide an easy way anyone on a Linux system can see this bug
and reproduce it! I think the problems are all in remote/inet.c and just in
the select_wait (client end bug) and packet_receive (server end bug). Fixing
this is as easy as adding a line of code to set the timeout at the top of the
loop rather than depending on it retaining its value. The man pages for
select on Unix systems often warned not to depend on the timeout not getting
changed but no other Unix system I know of ever went ahead and changed the
timeout on return so this is one of the most common Linux porting bugs.
--
Brad Pepers
brad@...
Just an added note on an easy way to watch some of this happening is to use
isql to a remote database (or using localhost) as I said and then use
"netstat | head" as root on Linux. You can see the TCP ports of the isql and
gds_inet_server programs and observe the Send-Q and Recv-Q fields. If you
leave isql just sitting there, after 60 seconds the Send-Q of the server
suddenly jumps up to 40812 and the Recv-Q of isql to 4008 (at least on my
system). I think the dummy request sent is 4 bytes so this means 10203 dummy
requests have been sent to the isql program! If you do an strace on the
running gds_inet_server you will see its blocking in trying to send even more
4 byte dummy requests.
Now if you try a command in isql and watch, its the clients turn to get into
a loop! The server is blocked waiting to send more data to the client and
the client has sent a request to the server but is not getting a result. The
client times out, gets into the same type of loop, and now floods the server
with some sort of 4 byte dummy request until it too block on send!
Now you've got both queues filled with data. After a while it all sorts out
though sometimes it will end up killing the client.
Just wanted to provide an easy way anyone on a Linux system can see this bug
and reproduce it! I think the problems are all in remote/inet.c and just in
the select_wait (client end bug) and packet_receive (server end bug). Fixing
this is as easy as adding a line of code to set the timeout at the top of the
loop rather than depending on it retaining its value. The man pages for
select on Unix systems often warned not to depend on the timeout not getting
changed but no other Unix system I know of ever went ahead and changed the
timeout on return so this is one of the most common Linux porting bugs.
--
Brad Pepers
brad@...