[Clusterusers] Mysterious node reboots

Thomas Helmuth thelmuth at cs.umass.edu
Thu Apr 24 14:11:31 EDT 2014


I experienced another strange thing with a bunch of my runs during the same
time period. I'm not sure if it's related or not. On 4/22 around 10PM I
started a large batch of runs. Then, between 2AM and 3PM on 4/23 I had
about 30 runs that were killed. Normally when there's a power failure, runs
end in errors and appear red in tractor. These were different -- they just
acted like they finished normally, except each had an error message similar
to this one:

/bin/sh: line 1: 15998 Killed /share/apps/bin/lein with-profiles production
trampoline run clojush.examples.wc :print-behavioral-diversity true
:ultra-pads-with-empties true :use-lexicase-selection false
:tournament-size 40 :print-timings true >
../Results/lexicase-paper/resub/wc/tourney40/log4.txt

These killed runs were only on a small number of nodes: 1-1,1-7, 1-12,
1-17, 4-1. None of those were listed in your power failures. Is there any
relation to those nodes, either in hardware or the power they're connected
to? It seemed like a bunch of runs were assigned to these nodes, is it
possible they got overloaded somehow, such as running out of memory or
something?

-Tom


On Thu, Apr 24, 2014 at 1:30 PM, Wm. Josiah Erikson <wjerikson at hampshire.edu
> wrote:

>  Hi guys,
>     This is also up on the blog, but in case you don't read it (it's at
> http://clusterstatus.i3ci.hampshire.edu), here's what's up. Any thoughts
> or ideas are welcome!
>
>
>  Some weird reboot/reinstalls occurred, that may or may not make any
> sense.
>
> 1d 36m ago, compute-1-18 rebooted. It's plugged into one of the white
> 2200VA UPSes. No other nodes plugged into this UPS went down at that time.
>
> Seventeen hours ago, the following nodes went down: 1-12, 1-14, and 2-22
> through 2-25. 1-12 and 1-14 are plugged into a power strip (along with many
> other nodes that did not go down) that is plugged into the 3000VA UPS that
> is second from the bottom in the stack. It has bad batteries. We stole from
> the bottom UPS, which is off, and made a new battery pack. 2-22 through
> 2-26 are plugged into the top half of the 208V PDU. All of these nodes were
> plugged into locations that had plenty of other nodes also plugged in that
> didn't go down.
>
> 2 hours ago, the following nodes went down: 1-5, 1-10, 2-1 through 2-4.
> These nodes are all plugged into the top 3000VA UPS in the stack. However,
> compute-1-11 and 1-9 are also plugged into that UPS, and they did not go
> down at that time. They're also fully loaded and have been the whole time.
>
> What gives? At this point it is entirely unclear to me.
>
> --
> Wm. Josiah Erikson
> Assistant Director of IT, Infrastructure Group
> System Administrator, School of CS
> Hampshire College
> Amherst, MA 01002(413) 559-6091
>
>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.hampshire.edu/pipermail/clusterusers/attachments/20140424/05769346/attachment.html>


More information about the Clusterusers mailing list