[Clusterusers] hex issues

Wed Dec 15 11:14:45 EST 2004

Well, this error seems to be in reference to a failure by the TCP/IP 
protocal stack failng to make a new socket connection. It's posssible 
that you never saw this message before b/c with a smaller number of 
nodes, you are opening less tcp sessions. Try running the script 
backwards through the nodes (starting with n23 down). I predict that the 
error will crop up between 1 and 8 somewhere. Or I'm completely wrong 
and I'll find a new theory. I'll keep looking for a way to tweak the 
TCP/IP stack to fix this. It could mean the need for a new kernel on the 
master node.

node 2 should be back to perfect.

-ryan

Lee Spector wrote:

>
> Thanks Jaime. That seems to be the same issue I'm experiencing, except 
> that I never had the problem on fly OR on hex prior to the recent node 
> additions. I haven't been doing enough runs on the new nodes to say if 
> the problem cropped up immediately when they were installed or 
> sometime thereafter. But I know that it never happened in thousands of 
> rsh calls prior to the new nodes. And it's now happening only for me 
> only on the new nodes.
>
> One other thing is that the problem does, in my case, interrupt my 
> script... either that or it always happens simultaneously on a run of 
> nodes from n??-n23, which seems unlikely.
>
> Also -- for the ones that are failing to accept rsh they are sometimes 
> doing it very persistently. I've had to resort to manually rshing to 
> each such node (vanilla "rsh" works to get an interactive shell, but 
> rsh with a command argument does not) to kill processes.
>
> This seems like something that we should fix rather than work around 
> (though the loops are clever!). Ryan: could you do a little digging on 
> that error message ("protocol failure in circuit setup") and see if 
> there's a known fix?
>
> Thanks,
>
>  -Lee
>
>
> On Dec 15, 2004, at 8:54 AM, Jaime Davila wrote:
>
>> I've noticed that rsh commands don't always go through. For me, at 
>> least, this is nothing new. I've recently moved to checking the 
>> return status of the command, and looping if it indicates a problem. 
>> For some reason, the same command to machine X at time T fails, while 
>> at time T+d is succeeds, even when d is very small (if I just loop as 
>> fast as the calling program can, it fails 5-10 times, and then 
>> succeeds). This leads me to believe there are issues, outside my 
>> control, that cause the command to not go through. I've placed a wait 
>> period of 5 seconds on those loops, and I've yet to see a command 
>> fail twice in a row.
>>
>> One thing to note is that the commands don't halt the rest of the 
>> scripts or programs, nor do the scripts crash. The rsh part of it 
>> simply fails.
>>
>> BTW, I've seen this type of thing when explicitly running rsh 
>> commands, or with other commands that use the same network 
>> facilities, like remote procedure calls.
>>
>> I've also seen the server side of my programs crash (or at least 
>> disappear from the list of running processes), but this has happened 
>> to me in many machines other that hex. I don't know enough about 
>> remote procedure call protocols to debug the issue and pinpoint the 
>> problem (the code I use is automatically generated by rcpgen).
>>
>> Lee Spector wrote:
>>
>>> Fellow cluster users,
>>> I've recently resumed doing some runs on hex and have been having 
>>> some puzzling problems.
>>> The most puzzling is that my script to start lisp runs on all of the 
>>> nodes seems most of the time to get interrupted at some random 
>>> point, meaning that it starts my processes on nodes n01-n??, where 
>>> n?? is occasionally 23 but sometimes 19, 15, or something else. One 
>>> possibility here is that the processes really are being started but 
>>> that they abort before they can do anything -- I thought this might 
>>> happen if the nodes were heavily loaded, but when I poked around 
>>> with w and top I didn't see much happening on the failing nodes... 
>>> And besides it seems unlikely that somebody's running short, intense 
>>> processes on the highest-numbered nodes, always in sequence from n23 
>>> down...
>>> In a possibly related problem I've been getting the following 
>>> message printed to my shell on the master node from time to time:
>>>      protocol failure in circuit setup
>>> I've googled this but only learned that it's an rsh error, which 
>>> makes sense in relation to the previous problem.
>>> Another problem is that n02 is unresponsive, and seemingly in a way 
>>> that's somehow worse than a normal crashed node. Normally a crashed 
>>> node causes a hiccup and an error message ("unreachable node" or 
>>> something like that) from some of my scripts that are currently 
>>> hanging instead. Incidentally, some of the scripts I use to do 
>>> diagnostics (and kill my processes!) are hanging this way, making it 
>>> hard to mess around much.
>>> Any leads or diagnostic suggestions?
>>> Thanks,
>>>  -Lee
>>> -- 
>>> Lee Spector
>>> Dean, Cognitive Science + Professor, Computer Science
>>> Cognitive Science, Hampshire College
>>> 893 West Street, Amherst, MA 01002-3359
>>> lspector at hampshire.edu, http://hampshire.edu/lspector/
>>> Phone: 413-559-5352, Fax: 413-559-5438
>>> _______________________________________________
>>> Clusterusers mailing list
>>> Clusterusers at lists.hampshire.edu
>>> http://lists.hampshire.edu/mailman/listinfo/clusterusers
>>
>>
>> -- 
>>
>> ******************************************************
>> Jaime J. Davila
>> Assistant Professor of Computer Science
>> Hampshire College
>> School of Cognitive Science
>> jdavila at hampshire dot edu
>> http://helios.hampshire.edu/jdavila
>> *******************************************************
>>
>>
>> _______________________________________________
>> Clusterusers mailing list
>> Clusterusers at lists.hampshire.edu
>> http://lists.hampshire.edu/mailman/listinfo/clusterusers
>>
>>
> -- 
> Lee Spector
> Dean, Cognitive Science + Professor, Computer Science
> Cognitive Science, Hampshire College
> 893 West Street, Amherst, MA 01002-3359
> lspector at hampshire.edu, http://hampshire.edu/lspector/
> Phone: 413-559-5352, Fax: 413-559-5438
>
>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> http://lists.hampshire.edu/mailman/listinfo/clusterusers