[Clusterusers] Re: now something seems really strange on fly
Wm. Josiah Erikson
wjerikson at hampshire.edu
Mon Oct 15 11:01:31 EDT 2007
I'm beginning to suspect an NFS bug - other people on the ROCKS list
have been talking about similar things. I think the processes aren't
dying because they're all waiting to write data to the NFS share and
they're in "uninterruptible iowait", which is a state in which they will
not die.
When I restarted the NFS server, lots of things got better. I'll see if
I can upgrade the NFS server on the head node and perhaps it will stop
crashing.
-Josiah
Lee Spector wrote:
>
> Josiah (a clusterusers and I'm cc-ing Jon too since there may be
> breve-related issues at play),
>
> Thanks for the detective work. Please do kill all of my undead breve
> processes. Some of them DO still appear to be generating output, but
> it's nothing useful at this point and it's time to drive the wooden
> stake through their hearts :-).
>
> I think we have too little data to know this is really breve, because
> I was also running breve a lot when this WASN'T happening. But it
> could conceivably be something in a recently installed version of
> breve... and now there's a new official release, and I don't know if
> that has been installed so maybe that would help (Jon -- can you make
> sure that the version on fly is the latest?).
>
> Since there was an NSF server crash on the head node and since Kyle
> has been experimenting with some new networking stuff (also in breve)
> I think that may be a useful place to look...
>
> Thanks,
>
> -Lee
>
>
> On Oct 15, 2007, at 10:43 AM, Wm. Josiah Erikson wrote:
>
>> Lee (and others),
>> I'm not sure yet exactly why, but here's the situation:
>>
>> -the "killed" breve jobs aren't showing up in cluster top
>> -breve is still running on the nodes, taking up memory and putting
>> the load average up, but not using any CPU (according to top when you
>> SSH into the node)
>> -the home directory missing issue is because the NFS server on the
>> head node has crashed. I restart it and everything is fine, but why
>> did it crash? The logs show nothing.
>>
>> -all of this seems to only happen when we run breve. But why? I'm
>> not convinced that my ulimit stuff worked... in fact, I'm pretty sure
>> it didn't.
>>
>>
>> I can't kill the breve jobs - they won't die, not even with a
>> signal 9. I think I"ll have to restart the nodes to get them to
>> behave again.
>>
>> Obviously, I've got my work cut out for me here :) I'll report
>> anything I figure out. I'm assuming for now that any breve jobs on
>> the cluster are junk, yes? I don't see any that are actually still
>> using CPU, meaning, I think, that you killed them but they didn't die.
>>
>>
>> -Josiah
>>
>>
>>
>> Lee Spector wrote:
>>>
>>> Josiah and clusterusers,
>>>
>>> I tried earlier today to kill all of my cluster jobs that had been
>>> running for the previous day or so. I thought it had worked, but
>>> just noticed that I had got a timeout after the first node. Tried
>>> again -- and my kill script hangs on ALL nodes, but continues to the
>>> next one if I command-C (over and over again to get to each node).
>>> Never seen that before. And when I got up to compute-1-13,
>>> compute-1-15, and compute-1-16, which might not even have had my
>>> jobs launched on them (because of a previous problem -- I think I
>>> wrote about this to Josiah but not to the cluster list), they asked
>>> me for a password. I've had to deal with node/password issues
>>> before, but I don't see any reason why this should have cropped up now.
>>>
>>> So something seems to be pretty hosed at the moment. All of my breve
>>> processes do seem to have been killed, assuming cluster-top is
>>> telling the truth, but something's wrong.
>>>
>>> All of which (along with the other recent problems) is really
>>> puzzling me because I've been running essentially this same code for
>>> months -- it's just breve/PushGP on a straightforward problem, using
>>> the same shell scripts that I've been using for months -- and I've
>>> never had problems anything like this before the last couple of
>>> weeks. What's changed? Is it the simultaneous use of the cluster for
>>> lots of rendering? But hasn't that been about the same for a while
>>> too? Could it be Kyle's new work with breve -- Kyle, is it possible
>>> that your block-transport code is consuming a lot of OS network
>>> resources of some kind? Any other ideas?
>>>
>>> -Lee
>>>
>>> --
>>> Lee Spector, Professor of Computer Science
>>> School of Cognitive Science, Hampshire College
>>> 893 West Street, Amherst, MA 01002-3359
>>> lspector at hampshire.edu, http://hampshire.edu/lspector/
>>> Phone: 413-559-5352, Fax: 413-559-5438
>>>
>>
>> --
>> Wm. Josiah Erikson
>> Computing Support
>> School of Cognitive Science
>> Hampshire College
>> Amherst, MA 01002
>> (413) 559-6091
>>
>> _______________________________________________
>> Clusterusers mailing list
>> Clusterusers at lists.hampshire.edu
>> http://lists.hampshire.edu/mailman/listinfo/clusterusers
>
> --
> Lee Spector, Professor of Computer Science
> School of Cognitive Science, Hampshire College
> 893 West Street, Amherst, MA 01002-3359
> lspector at hampshire.edu, http://hampshire.edu/lspector/
> Phone: 413-559-5352, Fax: 413-559-5438
>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> http://lists.hampshire.edu/mailman/listinfo/clusterusers
--
Wm. Josiah Erikson
Computing Support
School of Cognitive Science
Hampshire College
Amherst, MA 01002
(413) 559-6091
More information about the Clusterusers
mailing list