[Clusterusers] hex issues
Ryan Moore
ryan at hampshire.edu
Wed Dec 15 11:14:45 EST 2004
Well, this error seems to be in reference to a failure by the TCP/IP
protocal stack failng to make a new socket connection. It's posssible
that you never saw this message before b/c with a smaller number of
nodes, you are opening less tcp sessions. Try running the script
backwards through the nodes (starting with n23 down). I predict that the
error will crop up between 1 and 8 somewhere. Or I'm completely wrong
and I'll find a new theory. I'll keep looking for a way to tweak the
TCP/IP stack to fix this. It could mean the need for a new kernel on the
master node.
node 2 should be back to perfect.
-ryan
Lee Spector wrote:
>
> Thanks Jaime. That seems to be the same issue I'm experiencing, except
> that I never had the problem on fly OR on hex prior to the recent node
> additions. I haven't been doing enough runs on the new nodes to say if
> the problem cropped up immediately when they were installed or
> sometime thereafter. But I know that it never happened in thousands of
> rsh calls prior to the new nodes. And it's now happening only for me
> only on the new nodes.
>
> One other thing is that the problem does, in my case, interrupt my
> script... either that or it always happens simultaneously on a run of
> nodes from n??-n23, which seems unlikely.
>
> Also -- for the ones that are failing to accept rsh they are sometimes
> doing it very persistently. I've had to resort to manually rshing to
> each such node (vanilla "rsh" works to get an interactive shell, but
> rsh with a command argument does not) to kill processes.
>
> This seems like something that we should fix rather than work around
> (though the loops are clever!). Ryan: could you do a little digging on
> that error message ("protocol failure in circuit setup") and see if
> there's a known fix?
>
> Thanks,
>
> -Lee
>
>
> On Dec 15, 2004, at 8:54 AM, Jaime Davila wrote:
>
>> I've noticed that rsh commands don't always go through. For me, at
>> least, this is nothing new. I've recently moved to checking the
>> return status of the command, and looping if it indicates a problem.
>> For some reason, the same command to machine X at time T fails, while
>> at time T+d is succeeds, even when d is very small (if I just loop as
>> fast as the calling program can, it fails 5-10 times, and then
>> succeeds). This leads me to believe there are issues, outside my
>> control, that cause the command to not go through. I've placed a wait
>> period of 5 seconds on those loops, and I've yet to see a command
>> fail twice in a row.
>>
>> One thing to note is that the commands don't halt the rest of the
>> scripts or programs, nor do the scripts crash. The rsh part of it
>> simply fails.
>>
>> BTW, I've seen this type of thing when explicitly running rsh
>> commands, or with other commands that use the same network
>> facilities, like remote procedure calls.
>>
>> I've also seen the server side of my programs crash (or at least
>> disappear from the list of running processes), but this has happened
>> to me in many machines other that hex. I don't know enough about
>> remote procedure call protocols to debug the issue and pinpoint the
>> problem (the code I use is automatically generated by rcpgen).
>>
>> Lee Spector wrote:
>>
>>> Fellow cluster users,
>>> I've recently resumed doing some runs on hex and have been having
>>> some puzzling problems.
>>> The most puzzling is that my script to start lisp runs on all of the
>>> nodes seems most of the time to get interrupted at some random
>>> point, meaning that it starts my processes on nodes n01-n??, where
>>> n?? is occasionally 23 but sometimes 19, 15, or something else. One
>>> possibility here is that the processes really are being started but
>>> that they abort before they can do anything -- I thought this might
>>> happen if the nodes were heavily loaded, but when I poked around
>>> with w and top I didn't see much happening on the failing nodes...
>>> And besides it seems unlikely that somebody's running short, intense
>>> processes on the highest-numbered nodes, always in sequence from n23
>>> down...
>>> In a possibly related problem I've been getting the following
>>> message printed to my shell on the master node from time to time:
>>> protocol failure in circuit setup
>>> I've googled this but only learned that it's an rsh error, which
>>> makes sense in relation to the previous problem.
>>> Another problem is that n02 is unresponsive, and seemingly in a way
>>> that's somehow worse than a normal crashed node. Normally a crashed
>>> node causes a hiccup and an error message ("unreachable node" or
>>> something like that) from some of my scripts that are currently
>>> hanging instead. Incidentally, some of the scripts I use to do
>>> diagnostics (and kill my processes!) are hanging this way, making it
>>> hard to mess around much.
>>> Any leads or diagnostic suggestions?
>>> Thanks,
>>> -Lee
>>> --
>>> Lee Spector
>>> Dean, Cognitive Science + Professor, Computer Science
>>> Cognitive Science, Hampshire College
>>> 893 West Street, Amherst, MA 01002-3359
>>> lspector at hampshire.edu, http://hampshire.edu/lspector/
>>> Phone: 413-559-5352, Fax: 413-559-5438
>>> _______________________________________________
>>> Clusterusers mailing list
>>> Clusterusers at lists.hampshire.edu
>>> http://lists.hampshire.edu/mailman/listinfo/clusterusers
>>
>>
>> --
>>
>> ******************************************************
>> Jaime J. Davila
>> Assistant Professor of Computer Science
>> Hampshire College
>> School of Cognitive Science
>> jdavila at hampshire dot edu
>> http://helios.hampshire.edu/jdavila
>> *******************************************************
>>
>>
>> _______________________________________________
>> Clusterusers mailing list
>> Clusterusers at lists.hampshire.edu
>> http://lists.hampshire.edu/mailman/listinfo/clusterusers
>>
>>
> --
> Lee Spector
> Dean, Cognitive Science + Professor, Computer Science
> Cognitive Science, Hampshire College
> 893 West Street, Amherst, MA 01002-3359
> lspector at hampshire.edu, http://hampshire.edu/lspector/
> Phone: 413-559-5352, Fax: 413-559-5438
>
>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> http://lists.hampshire.edu/mailman/listinfo/clusterusers
More information about the Clusterusers
mailing list