[Clusterusers] hex issues
Jaime Davila
jdavila at hampshire.edu
Wed Dec 15 11:56:18 EST 2004
AHA! When I run my programs using just a few nodes, I don't get the
problem I described earlier.
Ryan Moore wrote:
> Well, this error seems to be in reference to a failure by the TCP/IP
> protocal stack failng to make a new socket connection. It's posssible
> that you never saw this message before b/c with a smaller number of
> nodes, you are opening less tcp sessions. Try running the script
> backwards through the nodes (starting with n23 down). I predict that the
> error will crop up between 1 and 8 somewhere. Or I'm completely wrong
> and I'll find a new theory. I'll keep looking for a way to tweak the
> TCP/IP stack to fix this. It could mean the need for a new kernel on the
> master node.
>
> node 2 should be back to perfect.
>
> -ryan
>
> Lee Spector wrote:
>
>>
>> Thanks Jaime. That seems to be the same issue I'm experiencing, except
>> that I never had the problem on fly OR on hex prior to the recent node
>> additions. I haven't been doing enough runs on the new nodes to say if
>> the problem cropped up immediately when they were installed or
>> sometime thereafter. But I know that it never happened in thousands of
>> rsh calls prior to the new nodes. And it's now happening only for me
>> only on the new nodes.
>>
>> One other thing is that the problem does, in my case, interrupt my
>> script... either that or it always happens simultaneously on a run of
>> nodes from n??-n23, which seems unlikely.
>>
>> Also -- for the ones that are failing to accept rsh they are sometimes
>> doing it very persistently. I've had to resort to manually rshing to
>> each such node (vanilla "rsh" works to get an interactive shell, but
>> rsh with a command argument does not) to kill processes.
>>
>> This seems like something that we should fix rather than work around
>> (though the loops are clever!). Ryan: could you do a little digging on
>> that error message ("protocol failure in circuit setup") and see if
>> there's a known fix?
>>
>> Thanks,
>>
>> -Lee
>>
>>
>> On Dec 15, 2004, at 8:54 AM, Jaime Davila wrote:
>>
>>> I've noticed that rsh commands don't always go through. For me, at
>>> least, this is nothing new. I've recently moved to checking the
>>> return status of the command, and looping if it indicates a problem.
>>> For some reason, the same command to machine X at time T fails, while
>>> at time T+d is succeeds, even when d is very small (if I just loop as
>>> fast as the calling program can, it fails 5-10 times, and then
>>> succeeds). This leads me to believe there are issues, outside my
>>> control, that cause the command to not go through. I've placed a wait
>>> period of 5 seconds on those loops, and I've yet to see a command
>>> fail twice in a row.
>>>
>>> One thing to note is that the commands don't halt the rest of the
>>> scripts or programs, nor do the scripts crash. The rsh part of it
>>> simply fails.
>>>
>>> BTW, I've seen this type of thing when explicitly running rsh
>>> commands, or with other commands that use the same network
>>> facilities, like remote procedure calls.
>>>
>>> I've also seen the server side of my programs crash (or at least
>>> disappear from the list of running processes), but this has happened
>>> to me in many machines other that hex. I don't know enough about
>>> remote procedure call protocols to debug the issue and pinpoint the
>>> problem (the code I use is automatically generated by rcpgen).
>>>
>>> Lee Spector wrote:
>>>
>>>> Fellow cluster users,
>>>> I've recently resumed doing some runs on hex and have been having
>>>> some puzzling problems.
>>>> The most puzzling is that my script to start lisp runs on all of the
>>>> nodes seems most of the time to get interrupted at some random
>>>> point, meaning that it starts my processes on nodes n01-n??, where
>>>> n?? is occasionally 23 but sometimes 19, 15, or something else. One
>>>> possibility here is that the processes really are being started but
>>>> that they abort before they can do anything -- I thought this might
>>>> happen if the nodes were heavily loaded, but when I poked around
>>>> with w and top I didn't see much happening on the failing nodes...
>>>> And besides it seems unlikely that somebody's running short, intense
>>>> processes on the highest-numbered nodes, always in sequence from n23
>>>> down...
>>>> In a possibly related problem I've been getting the following
>>>> message printed to my shell on the master node from time to time:
>>>> protocol failure in circuit setup
>>>> I've googled this but only learned that it's an rsh error, which
>>>> makes sense in relation to the previous problem.
>>>> Another problem is that n02 is unresponsive, and seemingly in a way
>>>> that's somehow worse than a normal crashed node. Normally a crashed
>>>> node causes a hiccup and an error message ("unreachable node" or
>>>> something like that) from some of my scripts that are currently
>>>> hanging instead. Incidentally, some of the scripts I use to do
>>>> diagnostics (and kill my processes!) are hanging this way, making it
>>>> hard to mess around much.
>>>> Any leads or diagnostic suggestions?
>>>> Thanks,
>>>> -Lee
>>>> --
>>>> Lee Spector
>>>> Dean, Cognitive Science + Professor, Computer Science
>>>> Cognitive Science, Hampshire College
>>>> 893 West Street, Amherst, MA 01002-3359
>>>> lspector at hampshire.edu, http://hampshire.edu/lspector/
>>>> Phone: 413-559-5352, Fax: 413-559-5438
>>>> _______________________________________________
>>>> Clusterusers mailing list
>>>> Clusterusers at lists.hampshire.edu
>>>> http://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>
>>>
>>>
>>> --
>>>
>>> ******************************************************
>>> Jaime J. Davila
>>> Assistant Professor of Computer Science
>>> Hampshire College
>>> School of Cognitive Science
>>> jdavila at hampshire dot edu
>>> http://helios.hampshire.edu/jdavila
>>> *******************************************************
>>>
>>>
>>> _______________________________________________
>>> Clusterusers mailing list
>>> Clusterusers at lists.hampshire.edu
>>> http://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>
>>>
>> --
>> Lee Spector
>> Dean, Cognitive Science + Professor, Computer Science
>> Cognitive Science, Hampshire College
>> 893 West Street, Amherst, MA 01002-3359
>> lspector at hampshire.edu, http://hampshire.edu/lspector/
>> Phone: 413-559-5352, Fax: 413-559-5438
>>
>>
>> _______________________________________________
>> Clusterusers mailing list
>> Clusterusers at lists.hampshire.edu
>> http://lists.hampshire.edu/mailman/listinfo/clusterusers
>
>
>
>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> http://lists.hampshire.edu/mailman/listinfo/clusterusers
--
******************************************************
Jaime J. Davila
Assistant Professor of Computer Science
Hampshire College
School of Cognitive Science
jdavila at hampshire dot edu
http://helios.hampshire.edu/jdavila
*******************************************************
More information about the Clusterusers
mailing list