[Clusterusers] A few nodes rebooted

Wm. Josiah Erikson wjerikson at hampshire.edu
Thu Oct 12 09:59:52 EDT 2017


I've upgraded the CPU cooler, with fresh thermal paste, on compute-1-3.
Hopefully it won't crash again. If it does, I'll replace its internals
with something lower power.

    -Josiah



On 10/12/17 9:28 AM, Wm. Josiah Erikson wrote:
> OK, we should be safe to restart jobs and not have them crash. Lee, I
> restarted all error tasks on the jobs that I knew were due to this crash
> and then realized maybe you wouldn't have wanted me to do that - sorry
> if that's the case.
>
> I noticed you hadn't done retry all error tasks on the ones that died
> because compute-0-2's hard drive died, which made me think maybe I had
> made a mistake in restarting the other ones. If not, you can just
> right-click the "retry all error tasks" on those jobs.
>
>     -Josiah
>
>
> On 10/12/17 7:41 AM, Wm. Josiah Erikson wrote:
>> Hi all,
>>
>>     In the past 24 hours, compute-1-3 (yesterday), and then last night,
>> compute-2-9 through compute-2-12 and compute-2-30 all rebooted
>> themselves and reinstalled due to overheating and/or overwhelming the
>> UPSes they are plugged into.
>>
>>     The A/C in the room can't keep up and neither can the UPSes, so I
>> think I'll try at least temporarily retiring the oldest and least
>> performance/watt-efficient nodes, which is I think at this point rack 4,
>> and see if that solves both problems, since we aren't using them for
>> much anyway.
>>

-- 
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
(413) 559-6091
pronouns: he/him/his



More information about the Clusterusers mailing list