[Clusterusers] New nodes
Wm. Josiah Erikson
wjerikson at hampshire.edu
Fri Jan 24 15:28:55 EST 2014
I'm sending this to the list, as the list may benefit from this knowledge:
Ah! I looked through the logs and figured it out. Here's what happened.
I think it's probably suitably random, but I'll leave that up to you:
-Owen (an animation student working with Perry, who I've added to this
list) spooled a render that tickled a bug in Maya that caused a simple,
routine render to chew up outrageous amounts of RAM
-Several nodes ran out of RAM
-The oom-killer (out of memory killer) killed the renders and/or your jobs
The oom-killer doesn't just kill the process that is using the most RAM
- it's more random than that, though it does prefer processes that are
using more RAM. I don't pretend to remember the exact details, but we
could look them up if it's important. However, I am pretty sure that the
oom-killer killing your jobs is probably suitably random for you to just
restart those ones.... but I'm not you, or your paper reviewers :)
We found the bug and also a workaround yesterday, so those jobs are
using reasonable amounts of RAM now.
All the best,
-Josiah
On 1/24/14 3:09 PM, Thomas Helmuth wrote:
> I'm seeing it on a variety of nodes, but all are on rack 2. All of the
> killed runs I'm seeing stopped between 3 and 7 AM yesterday morning,
> though that could just be coincidence since I started them Wednesday
> afternoon.
>
> -Tom
>
>
> On Fri, Jan 24, 2014 at 2:55 PM, Wm. Josiah Erikson
> <wjerikson at hampshire.edu <mailto:wjerikson at hampshire.edu>> wrote:
>
> That's very odd. What nodes did that happen on?
> Killed usually means something actually killed it - either it was
> pre-empted by tractor, killed by a script, or killed manually - it
> doesn't usually mean crashed.
> -Josiah
>
>
> On 1/24/14 2:52 PM, Thomas Helmuth wrote:
>> No problem about the 1-6 power cord.
>>
>> Any idea what killed the other runs I mentioned in my second
>> email? I might have to restart that whole set of runs to make
>> sure the stats aren't affected, and it would be nice to know that
>> they won't crash as well.
>>
>> -Tom
>>
>>
>> On Fri, Jan 24, 2014 at 10:53 AM, Thomas Helmuth
>> <thelmuth at cs.umass.edu <mailto:thelmuth at cs.umass.edu>> wrote:
>>
>> Hi Josiah,
>>
>> Probably unrelated, but it looks like yesterday morning I had
>> some runs stop before they were done, without printing any
>> errors I can find. The only thing I can find is when I select
>> the run in tractor it says something like:
>>
>> /bin/sh: line 1: 23661 Killed /share/apps/bin/lein
>> with-profiles production trampoline run clojush.examples.wc
>> :use-lexicase-selection false >
>> ../Results/GECCO14/wc/tourney-max-points-1000-two/log1.txt
>>
>> Most of that is just the shell command used to launch the
>> run, but do you know what that beginning part means, or have
>> any idea why these died quietly?
>>
>> Thanks,
>> Tom
>>
>>
>> On Fri, Jan 24, 2014 at 10:36 AM, Thomas Helmuth
>> <thelmuth at cs.umass.edu <mailto:thelmuth at cs.umass.edu>> wrote:
>>
>> Hi Josiah,
>>
>> Sounds good! At first I was worried this may have crashed
>> some of my runs, but it looks like they just never
>> started when they tried to run on compute-1-18 around the
>> time of the NIMBY, so its not a big deal.
>>
>> I did just have two runs crash on compute-1-6, any idea
>> if that's related?
>>
>> We are at crunch time (paper deadline Jan. 29), and the
>> runs I have going are pretty important for that, so it
>> would be great to have minimum disruptions until then!
>>
>> -Tom
>>
>>
>> On Fri, Jan 24, 2014 at 9:13 AM, Wm. Josiah Erikson
>> <wjerikson at hampshire.edu
>> <mailto:wjerikson at hampshire.edu>> wrote:
>>
>> Hi guys,
>> I've got compute-1-17 and compute-1-18 NIMBYed
>> for now because I'm going to move them to make room
>> (physically) to rack 1 to make room for 8 new nodes!
>> (each node has 2x X5560 quad-core 2.8Ghzprocessors
>> and 24GB of RAM)
>> Just so nobody un-NIMBY's them.
>>
>> --
>> Wm. Josiah Erikson
>> Assistant Director of IT, Infrastructure Group
>> System Administrator, School of CS
>> Hampshire College
>> Amherst, MA 01002
>> (413) 559-6091 <tel:%28413%29%20559-6091>
>>
>> _______________________________________________
>> Clusterusers mailing list
>> Clusterusers at lists.hampshire.edu
>> <mailto:Clusterusers at lists.hampshire.edu>
>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>
>>
>>
>>
>
> --
> Wm. Josiah Erikson
> Assistant Director of IT, Infrastructure Group
> System Administrator, School of CS
> Hampshire College
> Amherst, MA 01002
> (413) 559-6091 <tel:%28413%29%20559-6091>
>
>
--
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
(413) 559-6091
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.hampshire.edu/pipermail/clusterusers/attachments/20140124/0a364efa/attachment.html>
More information about the Clusterusers
mailing list