[Clusterusers] now something seems really strange on fly

Sun Oct 14 23:26:26 EDT 2007

Josiah and clusterusers,

I tried earlier today to kill all of my cluster jobs that had been  
running for the previous day or so. I thought it had worked, but just  
noticed that I had got a timeout after the first node. Tried again --  
and my kill script hangs on ALL nodes, but continues to the next one  
if I command-C (over and over again to get to each node). Never seen  
that before. And when I got up to compute-1-13, compute-1-15, and  
compute-1-16, which might not even have had my jobs launched on them  
(because of a previous problem -- I think I wrote about this to  
Josiah but not to the cluster list), they asked me for a password.  
I've had to deal with node/password issues before, but I don't see  
any reason why this should have cropped up now.

So something seems to be pretty hosed at the moment. All of my breve  
processes do seem to have been killed, assuming cluster-top is  
telling the truth, but something's wrong.

All of which (along with the other recent problems) is really  
puzzling me because I've been running essentially this same code for  
months -- it's just breve/PushGP on a straightforward problem, using  
the same shell scripts that I've been using for months -- and I've  
never had problems anything like this before the last couple of  
weeks. What's changed? Is it the simultaneous use of the cluster for  
lots of rendering? But hasn't that been about the same for a while  
too? Could it be Kyle's new work with breve -- Kyle, is it possible  
that your block-transport code is consuming a lot of OS network  
resources of some kind? Any other ideas?

  -Lee

--
Lee Spector, Professor of Computer Science
School of Cognitive Science, Hampshire College
893 West Street, Amherst, MA 01002-3359
lspector at hampshire.edu, http://hampshire.edu/lspector/
Phone: 413-559-5352, Fax: 413-559-5438