[Clusterusers] How bad would power outages tomorrow be?

Lee Spector lspector at hampshire.edu
Mon May 1 10:28:54 EDT 2017


It might... but I don't remember how to tell which are which. Can you remind me?

For the suites of runs, if any of them goes down then it compromises the value of the data, and we'd probably have to re-run the whole suite. (The reason, briefly: Suppose you have a setup that find solutions quickly on half of its runs, and takes a long time to fail, never finding a solution and eventually just hitting the generation limit. A power outage might kill all and only the ones that wouldn't ever have succeeded. If we start them all afresh later, then probably half of those restarted runs will succeed. So it'll look like a 75% success rate when really the success rate is 50%.)



> On May 1, 2017, at 10:24 AM, Wm. Josiah Erikson <wjerikson at hampshire.edu> wrote:
> 
> Remember it should only kill runs that aren't on "UPS" nodes. Does that
> change anything, or were you already considering that?
> 
>    -Josiah
> 
> 
> 
> On 5/1/17 10:18 AM, Lee Spector wrote:
>> Well.... it wouldn't be good. Not quite as awful as an interruption over this last weekend or today, since we have a bunch of publication deadlines tonight (actually 7:59am tomorrow) and some of those runs might produce results that will lead to last minute changes. But it will set invalidate some experiments involving suites of runs that have already consumed several days of CPU time on many nodes, and kill a few runs that have been going for a couple of weeks, for which I'm still curious about the outcome.
>> 
>> Tom also has some experiments involving large suites of runs that may have to be re-started if this happens.
>> 
>> That said, solar=good! I don't know how to make the call!
>> 
>> -Lee
>> 
>> 
>> 
>>> On May 1, 2017, at 9:59 AM, Wm. Josiah Erikson <wjerikson at hampshire.edu> wrote:
>>> 
>>> Hi all,
>>> 
>>>   There are a lot of runs going on the cluster right now. In light of
>>> that, I just got an announcement that there will be testing going on
>>> tomorrow morning to get our solar arrays online, which may cause some
>>> power disruptions, which may take some of our nodes offline. How bad is
>>> this, and should I get my boss to push back to the President on this?
>>> Lee in particular...
>>> 
>>> 
>>> -- 
>>> Wm. Josiah Erikson
>>> Assistant Director of IT, Infrastructure Group
>>> System Administrator, School of CS
>>> Hampshire College
>>> Amherst, MA 01002
>>> (413) 559-6091
>>> 
>>> _______________________________________________
>>> Clusterusers mailing list
>>> Clusterusers at lists.hampshire.edu
>>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>> --
>> Lee Spector, Professor of Computer Science
>> Director, Institute for Computational Intelligence
>> Hampshire College, Amherst, Massachusetts, USA
>> lspector at hampshire.edu, http://hampshire.edu/lspector/, 413-559-5352
>> 
>> _______________________________________________
>> Clusterusers mailing list
>> Clusterusers at lists.hampshire.edu
>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
> 
> -- 
> Wm. Josiah Erikson
> Assistant Director of IT, Infrastructure Group
> System Administrator, School of CS
> Hampshire College
> Amherst, MA 01002
> (413) 559-6091
> 
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> https://lists.hampshire.edu/mailman/listinfo/clusterusers

--
Lee Spector, Professor of Computer Science
Director, Institute for Computational Intelligence
Hampshire College, Amherst, Massachusetts, USA
lspector at hampshire.edu, http://hampshire.edu/lspector/, 413-559-5352



More information about the Clusterusers mailing list