[Clusterusers] hex issues

Wed Dec 15 10:01:47 EST 2004

Thanks Jaime. That seems to be the same issue I'm experiencing, except 
that I never had the problem on fly OR on hex prior to the recent node 
additions. I haven't been doing enough runs on the new nodes to say if 
the problem cropped up immediately when they were installed or sometime 
thereafter. But I know that it never happened in thousands of rsh calls 
prior to the new nodes. And it's now happening only for me only on the 
new nodes.

One other thing is that the problem does, in my case, interrupt my 
script... either that or it always happens simultaneously on a run of 
nodes from n??-n23, which seems unlikely.

Also -- for the ones that are failing to accept rsh they are sometimes 
doing it very persistently. I've had to resort to manually rshing to 
each such node (vanilla "rsh" works to get an interactive shell, but 
rsh with a command argument does not) to kill processes.

This seems like something that we should fix rather than work around 
(though the loops are clever!). Ryan: could you do a little digging on 
that error message ("protocol failure in circuit setup") and see if 
there's a known fix?

Thanks,

  -Lee

On Dec 15, 2004, at 8:54 AM, Jaime Davila wrote:

> I've noticed that rsh commands don't always go through. For me, at 
> least, this is nothing new. I've recently moved to checking the return 
> status of the command, and looping if it indicates a problem. For some 
> reason, the same command to machine X at time T fails, while at time 
> T+d is succeeds, even when d is very small (if I just loop as fast as 
> the calling program can, it fails 5-10 times, and then succeeds). This 
> leads me to believe there are issues, outside my control, that cause 
> the command to not go through. I've placed a wait period of 5 seconds 
> on those loops, and I've yet to see a command fail twice in a row.
>
> One thing to note is that the commands don't halt the rest of the 
> scripts or programs, nor do the scripts crash. The rsh part of it 
> simply fails.
>
> BTW, I've seen this type of thing when explicitly running rsh 
> commands, or with other commands that use the same network facilities, 
> like remote procedure calls.
>
> I've also seen the server side of my programs crash (or at least 
> disappear from the list of running processes), but this has happened 
> to me in many machines other that hex. I don't know enough about 
> remote procedure call protocols to debug the issue and pinpoint the 
> problem (the code I use is automatically generated by rcpgen).
>
> Lee Spector wrote:
>> Fellow cluster users,
>> I've recently resumed doing some runs on hex and have been having 
>> some puzzling problems.
>> The most puzzling is that my script to start lisp runs on all of the 
>> nodes seems most of the time to get interrupted at some random point, 
>> meaning that it starts my processes on nodes n01-n??, where n?? is 
>> occasionally 23 but sometimes 19, 15, or something else. One 
>> possibility here is that the processes really are being started but 
>> that they abort before they can do anything -- I thought this might 
>> happen if the nodes were heavily loaded, but when I poked around with 
>> w and top I didn't see much happening on the failing nodes... And 
>> besides it seems unlikely that somebody's running short, intense 
>> processes on the highest-numbered nodes, always in sequence from n23 
>> down...
>> In a possibly related problem I've been getting the following message 
>> printed to my shell on the master node from time to time:
>>      protocol failure in circuit setup
>> I've googled this but only learned that it's an rsh error, which 
>> makes sense in relation to the previous problem.
>> Another problem is that n02 is unresponsive, and seemingly in a way 
>> that's somehow worse than a normal crashed node. Normally a crashed 
>> node causes a hiccup and an error message ("unreachable node" or 
>> something like that) from some of my scripts that are currently 
>> hanging instead. Incidentally, some of the scripts I use to do 
>> diagnostics (and kill my processes!) are hanging this way, making it 
>> hard to mess around much.
>> Any leads or diagnostic suggestions?
>> Thanks,
>>  -Lee
>> -- 
>> Lee Spector
>> Dean, Cognitive Science + Professor, Computer Science
>> Cognitive Science, Hampshire College
>> 893 West Street, Amherst, MA 01002-3359
>> lspector at hampshire.edu, http://hampshire.edu/lspector/
>> Phone: 413-559-5352, Fax: 413-559-5438
>> _______________________________________________
>> Clusterusers mailing list
>> Clusterusers at lists.hampshire.edu
>> http://lists.hampshire.edu/mailman/listinfo/clusterusers
>
> -- 
>
> ******************************************************
> Jaime J. Davila
> Assistant Professor of Computer Science
> Hampshire College
> School of Cognitive Science
> jdavila at hampshire dot edu
> http://helios.hampshire.edu/jdavila
> *******************************************************
>
>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> http://lists.hampshire.edu/mailman/listinfo/clusterusers
>
>
--
Lee Spector
Dean, Cognitive Science + Professor, Computer Science
Cognitive Science, Hampshire College
893 West Street, Amherst, MA 01002-3359
lspector at hampshire.edu, http://hampshire.edu/lspector/
Phone: 413-559-5352, Fax: 413-559-5438