[Clusterusers] A few nodes rebooted

Lee Spector lspector at hampshire.edu
Thu Oct 12 11:17:48 EDT 2017


Thanks -- what you did was great, but FWIW for these runs, which were exploratory, it wouldn't have mattered much either way. When it gets messy is when we're collecting statistics over large numbers of runs to get solid evidence that something works better than something else, etc. Then we have to worry about whether the death/restarting of runs introduces bias, which it often can... but not here, so all is well!

Thanks!

 -Lee



> On Oct 12, 2017, at 9:28 AM, Wm. Josiah Erikson <wjerikson at hampshire.edu <mailto:wjerikson at hampshire.edu>> wrote:
> 
> OK, we should be safe to restart jobs and not have them crash. Lee, I
> restarted all error tasks on the jobs that I knew were due to this crash
> and then realized maybe you wouldn't have wanted me to do that - sorry
> if that's the case.
> 
> I noticed you hadn't done retry all error tasks on the ones that died
> because compute-0-2's hard drive died, which made me think maybe I had
> made a mistake in restarting the other ones. If not, you can just
> right-click the "retry all error tasks" on those jobs.
> 
>     -Josiah
> 
> 
> On 10/12/17 7:41 AM, Wm. Josiah Erikson wrote:
>> Hi all,
>> 
>>     In the past 24 hours, compute-1-3 (yesterday), and then last night,
>> compute-2-9 through compute-2-12 and compute-2-30 all rebooted
>> themselves and reinstalled due to overheating and/or overwhelming the
>> UPSes they are plugged into.
>> 
>>     The A/C in the room can't keep up and neither can the UPSes, so I
>> think I'll try at least temporarily retiring the oldest and least
>> performance/watt-efficient nodes, which is I think at this point rack 4,
>> and see if that solves both problems, since we aren't using them for
>> much anyway.
>> 
> 
> -- 
> Wm. Josiah Erikson
> Assistant Director of IT, Infrastructure Group
> System Administrator, School of CS
> Hampshire College
> Amherst, MA 01002
> (413) 559-6091
> pronouns: he/him/his
> 
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu <mailto:Clusterusers at lists.hampshire.edu>
> https://lists.hampshire.edu/mailman/listinfo/clusterusers

--
Lee Spector, Professor of Computer Science
Director, Institute for Computational Intelligence
Hampshire College, Amherst, Massachusetts, 01002, USA
lspector at hampshire.edu <mailto:lspector at hampshire.edu>, http://hampshire.edu/lspector/ <http://hampshire.edu/lspector/>, 413-559-5352

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.hampshire.edu/pipermail/clusterusers/attachments/20171012/48cdc011/attachment.html>


More information about the Clusterusers mailing list