[Clusterusers] Weird Fly Crash

Wm. Josiah Erikson wjerikson at hampshire.edu
Thu Mar 24 08:25:32 EDT 2016


So, this was actually a UPS failure, yet again. I have ordered new
batteries for the UPSes in question. Interestingly, very few nodes
actually went down, but the switch that connected all the nodes did.
Most jobs should be intact and now continuing, I hope...
    -Josiah


On 3/23/16 10:16 PM, Wm. Josiah Erikson wrote:
> Not just that, but a lot of the classroom machines are down to. I
> suspect some tractor job killed them all but I have no idea how or
> why. I will check it out in the morning. That is very frustrating and
> very strange.
>
> On March 23, 2016 7:52:44 PM EDT, Thomas Helmuth <trhtom at gmail.com>
> wrote:
>
>     Hi Josiah,
>
>     Something weird just happened on fly. I was sshed into
>     compute-4-4, and it was working fine. Suddenly, that terminal
>     froze, but my terminals to the head node were fine. Now I can't
>     ssh to any node besides the head node -- I get the following message:
>
>     [thelmuth at fly ~]$ ssh compute-2-2
>     ssh: connect to host compute-2-2 port 22: No route to host
>
>     [thelmuth at fly ~]$ ssh compute-4-2
>     ssh: connect to host compute-4-2 port 22: No route to host
>
>     Also, this page <http://fly.hampshire.edu/ganglia/> reports all
>     nodes as down. But, tractor still seems to think all runs are
>     still running -- not sure if that's true or not. Any idea what
>     happened?
>
>     Thanks,
>     Tom
>
>
> -- 
> Sent from my Android device with K-9 Mail. Please excuse my brevity. 

-- 
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
(413) 559-6091

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.hampshire.edu/pipermail/clusterusers/attachments/20160324/5b68547e/attachment.html>


More information about the Clusterusers mailing list