[Clusterusers] New nodes

Fri Jan 24 15:28:55 EST 2014

I'm sending this to the list, as the list may benefit from this knowledge:

Ah! I looked through the logs and figured it out. Here's what happened. 
I think it's probably suitably random, but I'll leave that up to you:

-Owen (an animation student working with Perry, who I've added to this 
list) spooled a render that tickled a bug in Maya that caused a simple, 
routine render to chew up outrageous amounts of RAM
-Several nodes ran out of RAM
-The oom-killer (out of memory killer) killed the renders and/or your jobs

The oom-killer doesn't just kill the process that is using the most RAM 
- it's more random than that, though it does prefer processes that are 
using more RAM. I don't pretend to remember the exact details, but we 
could look them up if it's important. However, I am pretty sure that the 
oom-killer killing your jobs is probably suitably random for you to just 
restart those ones.... but I'm not you, or your paper reviewers :)

We found the bug and also a workaround yesterday, so those jobs are 
using reasonable amounts of RAM now.

All the best,

     -Josiah

On 1/24/14 3:09 PM, Thomas Helmuth wrote:
> I'm seeing it on a variety of nodes, but all are on rack 2. All of the 
> killed runs I'm seeing stopped between 3 and 7 AM yesterday morning, 
> though that could just be coincidence since I started them Wednesday 
> afternoon.
>
> -Tom
>
>
> On Fri, Jan 24, 2014 at 2:55 PM, Wm. Josiah Erikson 
> <wjerikson at hampshire.edu <mailto:wjerikson at hampshire.edu>> wrote:
>
>     That's very odd. What nodes did that happen on?
>     Killed usually means something actually killed it - either it was
>     pre-empted by tractor, killed by a script, or killed manually - it
>     doesn't usually mean crashed.
>         -Josiah
>
>
>     On 1/24/14 2:52 PM, Thomas Helmuth wrote:
>>     No problem about the 1-6 power cord.
>>
>>     Any idea what killed the other runs I mentioned in my second
>>     email? I might have to restart that whole set of runs to make
>>     sure the stats aren't affected, and it would be nice to know that
>>     they won't crash as well.
>>
>>     -Tom
>>
>>
>>     On Fri, Jan 24, 2014 at 10:53 AM, Thomas Helmuth
>>     <thelmuth at cs.umass.edu <mailto:thelmuth at cs.umass.edu>> wrote:
>>
>>         Hi Josiah,
>>
>>         Probably unrelated, but it looks like yesterday morning I had
>>         some runs stop before they were done, without printing any
>>         errors I can find. The only thing I can find is when I select
>>         the run in tractor it says something like:
>>
>>         /bin/sh: line 1: 23661 Killed /share/apps/bin/lein
>>         with-profiles production trampoline run clojush.examples.wc
>>         :use-lexicase-selection false >
>>         ../Results/GECCO14/wc/tourney-max-points-1000-two/log1.txt
>>
>>         Most of that is just the shell command used to launch the
>>         run, but do you know what that beginning part means, or have
>>         any idea why these died quietly?
>>
>>         Thanks,
>>         Tom
>>
>>
>>         On Fri, Jan 24, 2014 at 10:36 AM, Thomas Helmuth
>>         <thelmuth at cs.umass.edu <mailto:thelmuth at cs.umass.edu>> wrote:
>>
>>             Hi Josiah,
>>
>>             Sounds good! At first I was worried this may have crashed
>>             some of my runs, but it looks like they just never
>>             started when they tried to run on compute-1-18 around the
>>             time of the NIMBY, so its not a big deal.
>>
>>             I did just have two runs crash on compute-1-6, any idea
>>             if that's related?
>>
>>             We are at crunch time (paper deadline Jan. 29), and the
>>             runs I have going are pretty important for that, so it
>>             would be great to have minimum disruptions until then!
>>
>>             -Tom
>>
>>
>>             On Fri, Jan 24, 2014 at 9:13 AM, Wm. Josiah Erikson
>>             <wjerikson at hampshire.edu
>>             <mailto:wjerikson at hampshire.edu>> wrote:
>>
>>                 Hi guys,
>>                     I've got compute-1-17 and compute-1-18 NIMBYed
>>                 for now because I'm going to move them to make room
>>                 (physically) to rack 1 to make room for 8 new nodes!
>>                 (each node has 2x X5560 quad-core 2.8Ghzprocessors
>>                 and 24GB of RAM)
>>                     Just so nobody un-NIMBY's them.
>>
>>                 -- 
>>                 Wm. Josiah Erikson
>>                 Assistant Director of IT, Infrastructure Group
>>                 System Administrator, School of CS
>>                 Hampshire College
>>                 Amherst, MA 01002
>>                 (413) 559-6091 <tel:%28413%29%20559-6091>
>>
>>                 _______________________________________________
>>                 Clusterusers mailing list
>>                 Clusterusers at lists.hampshire.edu
>>                 <mailto:Clusterusers at lists.hampshire.edu>
>>                 https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>
>>
>>
>>
>
>     -- 
>     Wm. Josiah Erikson
>     Assistant Director of IT, Infrastructure Group
>     System Administrator, School of CS
>     Hampshire College
>     Amherst, MA 01002
>     (413) 559-6091  <tel:%28413%29%20559-6091>
>
>

-- 
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
(413) 559-6091

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.hampshire.edu/pipermail/clusterusers/attachments/20140124/0a364efa/attachment.html>