[Clusterusers] A few nodes rebooted

Thu Oct 12 09:28:20 EDT 2017

OK, we should be safe to restart jobs and not have them crash. Lee, I
restarted all error tasks on the jobs that I knew were due to this crash
and then realized maybe you wouldn't have wanted me to do that - sorry
if that's the case.

I noticed you hadn't done retry all error tasks on the ones that died
because compute-0-2's hard drive died, which made me think maybe I had
made a mistake in restarting the other ones. If not, you can just
right-click the "retry all error tasks" on those jobs.

    -Josiah

On 10/12/17 7:41 AM, Wm. Josiah Erikson wrote:
> Hi all,
>
>     In the past 24 hours, compute-1-3 (yesterday), and then last night,
> compute-2-9 through compute-2-12 and compute-2-30 all rebooted
> themselves and reinstalled due to overheating and/or overwhelming the
> UPSes they are plugged into.
>
>     The A/C in the room can't keep up and neither can the UPSes, so I
> think I'll try at least temporarily retiring the oldest and least
> performance/watt-efficient nodes, which is I think at this point rack 4,
> and see if that solves both problems, since we aren't using them for
> much anyway.
>

-- 
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
(413) 559-6091
pronouns: he/him/his