<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN">
<html><body>
<p>Aha! That particular error that ends with "Killed" almost always means that the node ran out of RAM and the oom-killer was invoked. This usually means that the node should be rebooted. compute-4-1 in particular I just rebooted for this reason, because it had a whole bunch of old java processes hanging around and had almost no RAM left for anything else. They weren't yours - they were hz12's</p>
<p> </p>
<p>-Josiah</p>
<p> </p>
<div> </div>
<p>On 2014-04-24 14:11, Thomas Helmuth wrote:</p>
<blockquote type="cite" style="padding-left:5px; border-left:#1010ff 2px solid; margin-left:5px"><!-- html ignored --><!-- head ignored --><!-- meta ignored -->
<div dir="ltr">
<div>I experienced another strange thing with a bunch of my runs during the same time period. I'm not sure if it's related or not. On 4/22 around 10PM I started a large batch of runs. Then, between 2AM and 3PM on 4/23 I had about 30 runs that were killed. Normally when there's a power failure, runs end in errors and appear red in tractor. These were different -- they just acted like they finished normally, except each had an error message similar to this one:<br /><br />/bin/sh: line 1: 15998 Killed /share/apps/bin/lein with-profiles production trampoline run clojush.examples.wc :print-behavioral-diversity true :ultra-pads-with-empties true :use-lexicase-selection false :tournament-size 40 :print-timings true > ../Results/lexicase-paper/resub/wc/tourney40/log4.txt <br /><br /></div>
<div>These killed runs were only on a small number of nodes: 1-1,1-7, 1-12, 1-17, 4-1. None of those were listed in your power failures. Is there any relation to those nodes, either in hardware or the power they're connected to? It seemed like a bunch of runs were assigned to these nodes, is it possible they got overloaded somehow, such as running out of memory or something?<br /><br /></div>
<div>-Tom</div>
<div class="gmail_extra"><br /><br />
<div class="gmail_quote">On Thu, Apr 24, 2014 at 1:30 PM, Wm. Josiah Erikson <span><<a href="mailto:wjerikson@hampshire.edu">wjerikson@hampshire.edu</a>></span> wrote:<br />
<blockquote class="gmail_quote" style="margin: 0 0 0 .8ex; border-left: 1px #ccc solid; padding-left: 1ex;">
<div>Hi guys,<br /> This is also up on the blog, but in case you don't read it (it's at <a href="http://clusterstatus.i3ci.hampshire.edu">http://clusterstatus.i3ci.hampshire.edu</a>), here's what's up. Any thoughts or ideas are welcome!<br /><br /><br />
<div>
<p>Some weird reboot/reinstalls occurred, that may or may not make any sense.</p>
<p>1d 36m ago, compute-1-18 rebooted. It’s plugged into one of the white 2200VA UPSes. No other nodes plugged into this UPS went down at that time.</p>
<p>Seventeen hours ago, the following nodes went down: 1-12, 1-14, and 2-22 through 2-25. 1-12 and 1-14 are plugged into a power strip (along with many other nodes that did not go down) that is plugged into the 3000VA UPS that is second from the bottom in the stack. It has bad batteries. We stole from the bottom UPS, which is off, and made a new battery pack. 2-22 through 2-26 are plugged into the top half of the 208V PDU. All of these nodes were plugged into locations that had plenty of other nodes also plugged in that didn’t go down.</p>
<p>2 hours ago, the following nodes went down: 1-5, 1-10, 2-1 through 2-4. These nodes are all plugged into the top 3000VA UPS in the stack. However, compute-1-11 and 1-9 are also plugged into that UPS, and they did not go down at that time. They’re also fully loaded and have been the whole time.</p>
<p>What gives? At this point it is entirely unclear to me.</p>
<span><span style="color: #888888;"> </span></span></div>
<span><span><span style="color: #888888;"> <br /></span></span></span>
<pre>--
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
<a href="tel:%28413%29%20559-6091">(413) 559-6091</a>
</pre>
</div>
<br />_______________________________________________<br /> Clusterusers mailing list<br /><a href="mailto:Clusterusers@lists.hampshire.edu">Clusterusers@lists.hampshire.edu</a><br /><a href="https://lists.hampshire.edu/mailman/listinfo/clusterusers">https://lists.hampshire.edu/mailman/listinfo/clusterusers</a><br /><br /></blockquote>
</div>
</div>
</div>
<br />
<pre>_______________________________________________
Clusterusers mailing list
<a href="mailto:Clusterusers@lists.hampshire.edu">Clusterusers@lists.hampshire.edu</a>
<a href="https://lists.hampshire.edu/mailman/listinfo/clusterusers">https://lists.hampshire.edu/mailman/listinfo/clusterusers</a>
</pre>
</blockquote>
</body></html>