[Clusterusers] hex issues

Lee Spector lspector at hampshire.edu
Wed Dec 15 11:58:47 EST 2004


--- old stuff first ---

This sounds right, and the test does what you predicted -- it got from 
n23 down to n04.

So I guess there's a TCP/IP limit that should be jacked up.

FYI the communication for my systems is all master<->node, with none 
directly node<->node. But maybe for others we should ensure that heavy 
node<->node traffic is also permitted?

---- AH, just got your new mail ----

The fix sounds good, but I've just started a run since getting your 
mail and it's still choking... Does something else need to be 
restarted/rebooted?

Also, is there a reason not to jack things up higher (you did 
"increased the max instances parameter up to 80 from 60 and I increased 
the 'connections per second' parameter up to 30 from 25."). If I'm 
hitting the limits now then mightn't we hit it a lot more when there's 
more usage in the future? Or other kinds of applications? Is there any 
downside to setting max instances to 600 or whatever (and similarly for 
connections per second)? I wonder if the limits aren't set as they are 
because that makes more sense for server farm applications and the 
like, where the network requests come from the outside? But when we do 
a lot of networking it's because our applications demand it, and we 
want the messages to go through regardless of the delays from 
bottlenecks... Just a thought.

Thanks,

  -Lee



On Dec 15, 2004, at 11:14 AM, Ryan Moore wrote:

> Well, this error seems to be in reference to a failure by the TCP/IP 
> protocal stack failng to make a new socket connection. It's posssible 
> that you never saw this message before b/c with a smaller number of 
> nodes, you are opening less tcp sessions. Try running the script 
> backwards through the nodes (starting with n23 down). I predict that 
> the error will crop up between 1 and 8 somewhere. Or I'm completely 
> wrong and I'll find a new theory. I'll keep looking for a way to tweak 
> the TCP/IP stack to fix this. It could mean the need for a new kernel 
> on the master node.
>
> node 2 should be back to perfect.
>
> -ryan
>
> Lee Spector wrote:
>
>>
>> Thanks Jaime. That seems to be the same issue I'm experiencing, 
>> except that I never had the problem on fly OR on hex prior to the 
>> recent node additions. I haven't been doing enough runs on the new 
>> nodes to say if the problem cropped up immediately when they were 
>> installed or sometime thereafter. But I know that it never happened 
>> in thousands of rsh calls prior to the new nodes. And it's now 
>> happening only for me only on the new nodes.
>>
>> One other thing is that the problem does, in my case, interrupt my 
>> script... either that or it always happens simultaneously on a run of 
>> nodes from n??-n23, which seems unlikely.
>>
>> Also -- for the ones that are failing to accept rsh they are 
>> sometimes doing it very persistently. I've had to resort to manually 
>> rshing to each such node (vanilla "rsh" works to get an interactive 
>> shell, but rsh with a command argument does not) to kill processes.
>>
>> This seems like something that we should fix rather than work around 
>> (though the loops are clever!). Ryan: could you do a little digging 
>> on that error message ("protocol failure in circuit setup") and see 
>> if there's a known fix?
>>
>> Thanks,
>>
>>  -Lee
>>
>>
>> On Dec 15, 2004, at 8:54 AM, Jaime Davila wrote:
>>
>>> I've noticed that rsh commands don't always go through. For me, at 
>>> least, this is nothing new. I've recently moved to checking the 
>>> return status of the command, and looping if it indicates a problem. 
>>> For some reason, the same command to machine X at time T fails, 
>>> while at time T+d is succeeds, even when d is very small (if I just 
>>> loop as fast as the calling program can, it fails 5-10 times, and 
>>> then succeeds). This leads me to believe there are issues, outside 
>>> my control, that cause the command to not go through. I've placed a 
>>> wait period of 5 seconds on those loops, and I've yet to see a 
>>> command fail twice in a row.
>>>
>>> One thing to note is that the commands don't halt the rest of the 
>>> scripts or programs, nor do the scripts crash. The rsh part of it 
>>> simply fails.
>>>
>>> BTW, I've seen this type of thing when explicitly running rsh 
>>> commands, or with other commands that use the same network 
>>> facilities, like remote procedure calls.
>>>
>>> I've also seen the server side of my programs crash (or at least 
>>> disappear from the list of running processes), but this has happened 
>>> to me in many machines other that hex. I don't know enough about 
>>> remote procedure call protocols to debug the issue and pinpoint the 
>>> problem (the code I use is automatically generated by rcpgen).
>>>
>>> Lee Spector wrote:
>>>
>>>> Fellow cluster users,
>>>> I've recently resumed doing some runs on hex and have been having 
>>>> some puzzling problems.
>>>> The most puzzling is that my script to start lisp runs on all of 
>>>> the nodes seems most of the time to get interrupted at some random 
>>>> point, meaning that it starts my processes on nodes n01-n??, where 
>>>> n?? is occasionally 23 but sometimes 19, 15, or something else. One 
>>>> possibility here is that the processes really are being started but 
>>>> that they abort before they can do anything -- I thought this might 
>>>> happen if the nodes were heavily loaded, but when I poked around 
>>>> with w and top I didn't see much happening on the failing nodes... 
>>>> And besides it seems unlikely that somebody's running short, 
>>>> intense processes on the highest-numbered nodes, always in sequence 
>>>> from n23 down...
>>>> In a possibly related problem I've been getting the following 
>>>> message printed to my shell on the master node from time to time:
>>>>      protocol failure in circuit setup
>>>> I've googled this but only learned that it's an rsh error, which 
>>>> makes sense in relation to the previous problem.
>>>> Another problem is that n02 is unresponsive, and seemingly in a way 
>>>> that's somehow worse than a normal crashed node. Normally a crashed 
>>>> node causes a hiccup and an error message ("unreachable node" or 
>>>> something like that) from some of my scripts that are currently 
>>>> hanging instead. Incidentally, some of the scripts I use to do 
>>>> diagnostics (and kill my processes!) are hanging this way, making 
>>>> it hard to mess around much.
>>>> Any leads or diagnostic suggestions?
>>>> Thanks,
>>>>  -Lee
>>>> -- 
>>>> Lee Spector
>>>> Dean, Cognitive Science + Professor, Computer Science
>>>> Cognitive Science, Hampshire College
>>>> 893 West Street, Amherst, MA 01002-3359
>>>> lspector at hampshire.edu, http://hampshire.edu/lspector/
>>>> Phone: 413-559-5352, Fax: 413-559-5438
>>>> _______________________________________________
>>>> Clusterusers mailing list
>>>> Clusterusers at lists.hampshire.edu
>>>> http://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>
>>>
>>> -- 
>>>
>>> ******************************************************
>>> Jaime J. Davila
>>> Assistant Professor of Computer Science
>>> Hampshire College
>>> School of Cognitive Science
>>> jdavila at hampshire dot edu
>>> http://helios.hampshire.edu/jdavila
>>> *******************************************************
>>>
>>>
>>> _______________________________________________
>>> Clusterusers mailing list
>>> Clusterusers at lists.hampshire.edu
>>> http://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>
>>>
>> -- 
>> Lee Spector
>> Dean, Cognitive Science + Professor, Computer Science
>> Cognitive Science, Hampshire College
>> 893 West Street, Amherst, MA 01002-3359
>> lspector at hampshire.edu, http://hampshire.edu/lspector/
>> Phone: 413-559-5352, Fax: 413-559-5438
>>
>>
>> _______________________________________________
>> Clusterusers mailing list
>> Clusterusers at lists.hampshire.edu
>> http://lists.hampshire.edu/mailman/listinfo/clusterusers
>
>
>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> http://lists.hampshire.edu/mailman/listinfo/clusterusers
>
>
--
Lee Spector
Dean, Cognitive Science + Professor, Computer Science
Cognitive Science, Hampshire College
893 West Street, Amherst, MA 01002-3359
lspector at hampshire.edu, http://hampshire.edu/lspector/
Phone: 413-559-5352, Fax: 413-559-5438





More information about the Clusterusers mailing list