[Clusterusers] Weird Fly Crash

Thomas Helmuth thelmuth at cs.umass.edu
Thu Mar 24 13:18:07 EDT 2016


Yes, everything seems to be running still even though tractor thinks they
all crashed. So, please don't cancel/delete any "crashed" runs, since they
might be still running. Good that it didn't totally crash!

Tom

On Thu, Mar 24, 2016 at 8:25 AM, Wm. Josiah Erikson <wjerikson at hampshire.edu
> wrote:

> So, this was actually a UPS failure, yet again. I have ordered new
> batteries for the UPSes in question. Interestingly, very few nodes actually
> went down, but the switch that connected all the nodes did. Most jobs
> should be intact and now continuing, I hope...
>     -Josiah
>
>
>
> On 3/23/16 10:16 PM, Wm. Josiah Erikson wrote:
>
> Not just that, but a lot of the classroom machines are down to. I suspect
> some tractor job killed them all but I have no idea how or why. I will
> check it out in the morning. That is very frustrating and very strange.
>
> On March 23, 2016 7:52:44 PM EDT, Thomas Helmuth <trhtom at gmail.com>
> <trhtom at gmail.com> wrote:
>>
>> Hi Josiah,
>>
>> Something weird just happened on fly. I was sshed into compute-4-4, and
>> it was working fine. Suddenly, that terminal froze, but my terminals to the
>> head node were fine. Now I can't ssh to any node besides the head node -- I
>> get the following message:
>>
>> [thelmuth at fly ~]$ ssh compute-2-2
>> ssh: connect to host compute-2-2 port 22: No route to host
>>
>> [thelmuth at fly ~]$ ssh compute-4-2
>> ssh: connect to host compute-4-2 port 22: No route to host
>>
>> Also, this page <http://fly.hampshire.edu/ganglia/> reports all nodes as
>> down. But, tractor still seems to think all runs are still running -- not
>> sure if that's true or not. Any idea what happened?
>>
>> Thanks,
>> Tom
>>
>
> --
> Sent from my Android device with K-9 Mail. Please excuse my brevity.
>
>
> --
> Wm. Josiah Erikson
> Assistant Director of IT, Infrastructure Group
> System Administrator, School of CS
> Hampshire College
> Amherst, MA 01002(413) 559-6091
>
>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.hampshire.edu/pipermail/clusterusers/attachments/20160324/78ef6fee/attachment.html>


More information about the Clusterusers mailing list