[Clusterusers] ...and another

Wm. Josiah Erikson wjerikson at hampshire.edu
Fri Jul 29 10:10:27 EDT 2016


OK we have a ups tag. I hope it is pretty accurate. It should have no
false positives, but some false negatives - but those are in rack4,
which I think you ups-caring-about people don't use anyway, right?

So just add the "ups" tag to any other tag you were already using, with
a comma, if I remember correctly. So like "tom,ups"

Let me know if anything doesn't look right.

    -Josiah



On 7/28/16 1:03 PM, Lee Spector wrote:
> I like the "ups" tag as well. Our runs are long (days or weeks), and losing them can sometimes be much worse even than losing the ones that actually die, because even if we re-run those, the stats for the entire experiment are messed up. 
>
> (Short story: even if the outage had nothing to do with the run itself, the fact that longer-running runs are more likely to get hit by random outages means that longer-running runs are more likely to be restarted, which, if there is a correlation between longer-runningness and eventual failure -- which there probably is -- means that restarting those runs will bias averages toward success.) 
>
> So if we can target things only toward nodes that are protected then that would be helpful.
>
>  -Lee
>
>
>> On Jul 28, 2016, at 10:49 AM, Bassam Kurdali <bassam at urchn.org> wrote:
>>
>>  That makes sense! my renders are relatively short (duration), so I'd
>> be fine with the occasional outage.
>>  
>> On Thu, 2016-07-28 at 09:43 -0400, Wm. Josiah Erikson wrote:
>>> ...or I could add another service tag, "ups", and runs that need to
>>> be
>>> able to run for long periods of time could be submitted with that
>>> tag. I
>>> think that's a better solution. I'll do that, and let you all know
>>> when
>>> it's done.
>>>
>>>     -Josiah
>>>
>>>
>>>
>>> On 7/28/16 9:42 AM, Wm. Josiah Erikson wrote:
>>>> I hear from the power company that we can expect regular brownouts
>>>> until
>>>> this heat wave passes... should I just leave the cluster as-is in
>>>> that
>>>> case (the ones that are UPS protected) and just turn the rest off
>>>> so
>>>> that runs can actually finish?
>>>>
>>>>
>> _______________________________________________
>> Clusterusers mailing list
>> Clusterusers at lists.hampshire.edu
>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
> --
> Lee Spector, Professor of Computer Science
> Director, Institute for Computational Intelligence
> Hampshire College, Amherst, Massachusetts, USA
> lspector at hampshire.edu, http://hampshire.edu/lspector/, 413-559-5352
>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> https://lists.hampshire.edu/mailman/listinfo/clusterusers

-- 
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
(413) 559-6091



More information about the Clusterusers mailing list