<div dir="ltr"><div>Yes, everything seems to be running still even though tractor thinks they all crashed. So, please don't cancel/delete any "crashed" runs, since they might be still running. Good that it didn't totally crash!<br><br></div>Tom<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Mar 24, 2016 at 8:25 AM, Wm. Josiah Erikson <span dir="ltr"><<a href="mailto:wjerikson@hampshire.edu" target="_blank">wjerikson@hampshire.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000">
So, this was actually a UPS failure, yet again. I have ordered new
batteries for the UPSes in question. Interestingly, very few nodes
actually went down, but the switch that connected all the nodes did.
Most jobs should be intact and now continuing, I hope...<br>
-Josiah<div><div class="h5"><br>
<br>
<br>
<div>On 3/23/16 10:16 PM, Wm. Josiah Erikson
wrote:<br>
</div>
<blockquote type="cite">Not just that, but a lot of the classroom machines are
down to. I suspect some tractor job killed them all but I have no
idea how or why. I will check it out in the morning. That is very
frustrating and very strange.<br>
<br>
<div class="gmail_quote">On March 23, 2016 7:52:44 PM EDT, Thomas
Helmuth <a href="mailto:trhtom@gmail.com" target="_blank"><trhtom@gmail.com></a> wrote:
<blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div>
<div>
<div>
<div>Hi Josiah,<br>
<br>
</div>
Something weird just happened on fly. I was sshed into
compute-4-4, and it was working fine. Suddenly, that
terminal froze, but my terminals to the head node were
fine. Now I can't ssh to any node besides the head
node -- I get the following message:<br>
<br>
[thelmuth@fly ~]$ ssh compute-2-2<br>
ssh: connect to host compute-2-2 port 22: No route to
host<br>
<br>
[thelmuth@fly ~]$ ssh compute-4-2<br>
ssh: connect to host compute-4-2 port 22: No route to
host<br>
<br>
</div>
Also, <a href="http://fly.hampshire.edu/ganglia/" target="_blank">this page</a>
reports all nodes as down. But, tractor still seems to
think all runs are still running -- not sure if that's
true or not. Any idea what happened?<br>
<br>
</div>
Thanks,<br>
</div>
Tom<br>
</div>
</blockquote>
</div>
<br>
-- <br>
Sent from my Android device with K-9 Mail. Please excuse my
brevity.
</blockquote>
<br>
</div></div><span class="HOEnZb"><font color="#888888"><pre cols="72">--
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
<a href="tel:%28413%29%20559-6091" value="+14135596091" target="_blank">(413) 559-6091</a>
</pre>
</font></span></div>
<br>_______________________________________________<br>
Clusterusers mailing list<br>
<a href="mailto:Clusterusers@lists.hampshire.edu">Clusterusers@lists.hampshire.edu</a><br>
<a href="https://lists.hampshire.edu/mailman/listinfo/clusterusers" rel="noreferrer" target="_blank">https://lists.hampshire.edu/mailman/listinfo/clusterusers</a><br>
<br></blockquote></div><br></div>