[Clusterusers] hex issues

Wed Dec 15 11:56:18 EST 2004

AHA! When I run my programs using just a few nodes, I don't get the 
problem I described earlier.

Ryan Moore wrote:
> Well, this error seems to be in reference to a failure by the TCP/IP 
> protocal stack failng to make a new socket connection. It's posssible 
> that you never saw this message before b/c with a smaller number of 
> nodes, you are opening less tcp sessions. Try running the script 
> backwards through the nodes (starting with n23 down). I predict that the 
> error will crop up between 1 and 8 somewhere. Or I'm completely wrong 
> and I'll find a new theory. I'll keep looking for a way to tweak the 
> TCP/IP stack to fix this. It could mean the need for a new kernel on the 
> master node.
> 
> node 2 should be back to perfect.
> 
> -ryan
> 
> Lee Spector wrote:
> 
>>
>> Thanks Jaime. That seems to be the same issue I'm experiencing, except 
>> that I never had the problem on fly OR on hex prior to the recent node 
>> additions. I haven't been doing enough runs on the new nodes to say if 
>> the problem cropped up immediately when they were installed or 
>> sometime thereafter. But I know that it never happened in thousands of 
>> rsh calls prior to the new nodes. And it's now happening only for me 
>> only on the new nodes.
>>
>> One other thing is that the problem does, in my case, interrupt my 
>> script... either that or it always happens simultaneously on a run of 
>> nodes from n??-n23, which seems unlikely.
>>
>> Also -- for the ones that are failing to accept rsh they are sometimes 
>> doing it very persistently. I've had to resort to manually rshing to 
>> each such node (vanilla "rsh" works to get an interactive shell, but 
>> rsh with a command argument does not) to kill processes.
>>
>> This seems like something that we should fix rather than work around 
>> (though the loops are clever!). Ryan: could you do a little digging on 
>> that error message ("protocol failure in circuit setup") and see if 
>> there's a known fix?
>>
>> Thanks,
>>
>>  -Lee
>>
>>
>> On Dec 15, 2004, at 8:54 AM, Jaime Davila wrote:
>>
>>> I've noticed that rsh commands don't always go through. For me, at 
>>> least, this is nothing new. I've recently moved to checking the 
>>> return status of the command, and looping if it indicates a problem. 
>>> For some reason, the same command to machine X at time T fails, while 
>>> at time T+d is succeeds, even when d is very small (if I just loop as 
>>> fast as the calling program can, it fails 5-10 times, and then 
>>> succeeds). This leads me to believe there are issues, outside my 
>>> control, that cause the command to not go through. I've placed a wait 
>>> period of 5 seconds on those loops, and I've yet to see a command 
>>> fail twice in a row.
>>>
>>> One thing to note is that the commands don't halt the rest of the 
>>> scripts or programs, nor do the scripts crash. The rsh part of it 
>>> simply fails.
>>>
>>> BTW, I've seen this type of thing when explicitly running rsh 
>>> commands, or with other commands that use the same network 
>>> facilities, like remote procedure calls.
>>>
>>> I've also seen the server side of my programs crash (or at least 
>>> disappear from the list of running processes), but this has happened 
>>> to me in many machines other that hex. I don't know enough about 
>>> remote procedure call protocols to debug the issue and pinpoint the 
>>> problem (the code I use is automatically generated by rcpgen).
>>>
>>> Lee Spector wrote:
>>>
>>>> Fellow cluster users,
>>>> I've recently resumed doing some runs on hex and have been having 
>>>> some puzzling problems.
>>>> The most puzzling is that my script to start lisp runs on all of the 
>>>> nodes seems most of the time to get interrupted at some random 
>>>> point, meaning that it starts my processes on nodes n01-n??, where 
>>>> n?? is occasionally 23 but sometimes 19, 15, or something else. One 
>>>> possibility here is that the processes really are being started but 
>>>> that they abort before they can do anything -- I thought this might 
>>>> happen if the nodes were heavily loaded, but when I poked around 
>>>> with w and top I didn't see much happening on the failing nodes... 
>>>> And besides it seems unlikely that somebody's running short, intense 
>>>> processes on the highest-numbered nodes, always in sequence from n23 
>>>> down...
>>>> In a possibly related problem I've been getting the following 
>>>> message printed to my shell on the master node from time to time:
>>>>      protocol failure in circuit setup
>>>> I've googled this but only learned that it's an rsh error, which 
>>>> makes sense in relation to the previous problem.
>>>> Another problem is that n02 is unresponsive, and seemingly in a way 
>>>> that's somehow worse than a normal crashed node. Normally a crashed 
>>>> node causes a hiccup and an error message ("unreachable node" or 
>>>> something like that) from some of my scripts that are currently 
>>>> hanging instead. Incidentally, some of the scripts I use to do 
>>>> diagnostics (and kill my processes!) are hanging this way, making it 
>>>> hard to mess around much.
>>>> Any leads or diagnostic suggestions?
>>>> Thanks,
>>>>  -Lee
>>>> -- 
>>>> Lee Spector
>>>> Dean, Cognitive Science + Professor, Computer Science
>>>> Cognitive Science, Hampshire College
>>>> 893 West Street, Amherst, MA 01002-3359
>>>> lspector at hampshire.edu, http://hampshire.edu/lspector/
>>>> Phone: 413-559-5352, Fax: 413-559-5438
>>>> _______________________________________________
>>>> Clusterusers mailing list
>>>> Clusterusers at lists.hampshire.edu
>>>> http://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>
>>>
>>>
>>> -- 
>>>
>>> ******************************************************
>>> Jaime J. Davila
>>> Assistant Professor of Computer Science
>>> Hampshire College
>>> School of Cognitive Science
>>> jdavila at hampshire dot edu
>>> http://helios.hampshire.edu/jdavila
>>> *******************************************************
>>>
>>>
>>> _______________________________________________
>>> Clusterusers mailing list
>>> Clusterusers at lists.hampshire.edu
>>> http://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>
>>>
>> -- 
>> Lee Spector
>> Dean, Cognitive Science + Professor, Computer Science
>> Cognitive Science, Hampshire College
>> 893 West Street, Amherst, MA 01002-3359
>> lspector at hampshire.edu, http://hampshire.edu/lspector/
>> Phone: 413-559-5352, Fax: 413-559-5438
>>
>>
>> _______________________________________________
>> Clusterusers mailing list
>> Clusterusers at lists.hampshire.edu
>> http://lists.hampshire.edu/mailman/listinfo/clusterusers
> 
> 
> 
> 
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> http://lists.hampshire.edu/mailman/listinfo/clusterusers

-- 

******************************************************
Jaime J. Davila
Assistant Professor of Computer Science
Hampshire College
School of Cognitive Science
jdavila at hampshire dot edu
http://helios.hampshire.edu/jdavila
*******************************************************