[Clusterusers] hex issues

Wed Dec 15 08:54:25 EST 2004

I've noticed that rsh commands don't always go through. For me, at 
least, this is nothing new. I've recently moved to checking the return 
status of the command, and looping if it indicates a problem. For some 
reason, the same command to machine X at time T fails, while at time T+d 
is succeeds, even when d is very small (if I just loop as fast as the 
calling program can, it fails 5-10 times, and then succeeds). This leads 
me to believe there are issues, outside my control, that cause the 
command to not go through. I've placed a wait period of 5 seconds on 
those loops, and I've yet to see a command fail twice in a row.

One thing to note is that the commands don't halt the rest of the 
scripts or programs, nor do the scripts crash. The rsh part of it simply 
fails.

BTW, I've seen this type of thing when explicitly running rsh commands, 
or with other commands that use the same network facilities, like remote 
procedure calls.

I've also seen the server side of my programs crash (or at least 
disappear from the list of running processes), but this has happened to 
me in many machines other that hex. I don't know enough about remote 
procedure call protocols to debug the issue and pinpoint the problem 
(the code I use is automatically generated by rcpgen).

Lee Spector wrote:
> 
> Fellow cluster users,
> 
> I've recently resumed doing some runs on hex and have been having some 
> puzzling problems.
> 
> The most puzzling is that my script to start lisp runs on all of the 
> nodes seems most of the time to get interrupted at some random point, 
> meaning that it starts my processes on nodes n01-n??, where n?? is 
> occasionally 23 but sometimes 19, 15, or something else. One possibility 
> here is that the processes really are being started but that they abort 
> before they can do anything -- I thought this might happen if the nodes 
> were heavily loaded, but when I poked around with w and top I didn't see 
> much happening on the failing nodes... And besides it seems unlikely 
> that somebody's running short, intense processes on the highest-numbered 
> nodes, always in sequence from n23 down...
> 
> In a possibly related problem I've been getting the following message 
> printed to my shell on the master node from time to time:
> 
>      protocol failure in circuit setup
> 
> I've googled this but only learned that it's an rsh error, which makes 
> sense in relation to the previous problem.
> 
> Another problem is that n02 is unresponsive, and seemingly in a way 
> that's somehow worse than a normal crashed node. Normally a crashed node 
> causes a hiccup and an error message ("unreachable node" or something 
> like that) from some of my scripts that are currently hanging instead. 
> Incidentally, some of the scripts I use to do diagnostics (and kill my 
> processes!) are hanging this way, making it hard to mess around much.
> 
> Any leads or diagnostic suggestions?
> 
> Thanks,
> 
>  -Lee
> 
> -- 
> Lee Spector
> Dean, Cognitive Science + Professor, Computer Science
> Cognitive Science, Hampshire College
> 893 West Street, Amherst, MA 01002-3359
> lspector at hampshire.edu, http://hampshire.edu/lspector/
> Phone: 413-559-5352, Fax: 413-559-5438
> 
> 
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> http://lists.hampshire.edu/mailman/listinfo/clusterusers

-- 

******************************************************
Jaime J. Davila
Assistant Professor of Computer Science
Hampshire College
School of Cognitive Science
jdavila at hampshire dot edu
http://helios.hampshire.edu/jdavila
*******************************************************