Wednesday, June 30, 2004

What Hangs pbs_server?

Several problems I found on OpenPBS are connection related. OpenPBS tends to connect to the nodes sequentially while c3 package (came with OSCAR) also doing the same thing. If one of the connection is hanging there too long, the whole procedure halts. Or, if the connections delay too long and exceed the overall timeout limit, the whole procedure aborts. However, if all the connections happened at the same time, it would eat-up the bandwidth of LAN traffics. Maybe the way-out is to change the whole scheme to event-driven, i.e., pbs_server listens mostly, and pbs_mom does the speak.

So far if pbs refuses to update the nodes status and return a fake one because the timeout issue, you can only try to mark the down nodes offline, yes, manually.

I found an interesting article talking about the way to check which node is bad at the moment qstat doesn't work. Also I need to point out is that we still don't have any nice way to remove the toubled sockets at this time.

From: Karsten Petersen
Subject: [TORQUEUsers] pbs_server hangs when a pbs_mom is down.

On Tue, 14 Oct 2003, Don Brace wrote:
> It is sometimes difficult to determine which node is causing the
> problem. Is there an automated way to determine which node is causing
> the problem?

We see this problem with OpenPBS 2.3.15 about once a month.

You should be able to identify the node by looking at the open
sockets of the pbs_server process.

With Linux:
    lsof -p `pgrep pbs_server` | grep IPv4
    
If everything is running well, it looks like this:
    pbs_serve 10832 root 6u IPv4 937764267 TCP *:pbs (LISTEN)
    pbs_serve 10832 root 7u IPv4 937764286 UDP *:15001 
    pbs_serve 10832 root 8u IPv4 937764287 UDP *:1023 

But if pbs_server hangs (no qstat output), you see several connections
to the dead node that are in the ESTABLISHED state:
    [...]
    pbs_serve 1780 [...] TCP clic0a1:1023->clic4l43:pbs_mom (ESTABLISHED)
    pbs_serve 1780 [...] TCP clic0a1:1022->clic4l43:pbs_mom (ESTABLISHED)
    pbs_serve 1780 [...] TCP clic0a1:1021->clic4l43:pbs_mom (ESTABLISHED)
    pbs_serve 1780 [...] TCP clic0a1:1020->clic4l43:pbs_mom (ESTABLISHED)
    [...]

No comments: