[Clusterusers] NIMBYed nodes
Wm. Josiah Erikson
wjerikson at hampshire.edu
Thu Apr 23 09:27:46 EDT 2015
I added him to the list.
-Josiah
On 4/22/15 9:20 PM, Piper Odegard wrote:
>
> Huh, other frames in the scene had been rendering fine on the rest of
> rack2—something was funky. Whatever the case is, we're forging ahead.
>
> Also, I just NIMBY'd asheclass-08 because someone is rendering their
> anim 2 final on it locally.
>
> Also also, I've cc'd Jordan Miron—he's started rendering his final
> project on the farm, so being in the loop wouldn't hurt.
>
> - Piper
>
>
>
>
>
> On 2015-04-22 17:32, Wm. Josiah Erikson wrote:
>
>> Nope, don't think so. It errored because it ran out of RAM. Keep in
>> mind it has 24GB of that, so something is probably up with your scene
>> still. Maybe if you got lucky and went to one of the really really
>> big nodes that frame made it through :) I just rebooted it and I will
>> un-NIMBY it.
>> -Josiah
>>
>>
>>
>> On 4/22/15 3:16 PM, Piper Odegard wrote:
>>>
>>> Hey Josiah,
>>>
>>> I just NIMBY'd compute 2-1 because it errored out 3 tasks in a row
>>> on one of my jobs—in the future, is there a way on the fly to tell a
>>> job not to use a particular
>>> machine for the rest of its run?
>>>
>>> - Piper
>>>
>>>
>>>
>>>
>>>
>>> On 2015-04-20 18:59, Thomas Helmuth wrote:
>>>
>>> Awesome, thanks Chris! I should be able to merge that idea into
>>> my current launching scripts to get Lee going, but we'll page
>>> you if we run into trouble.
>>>
>>> Tom
>>>
>>> On Mon, Apr 20, 2015 at 6:57 PM, Lee Spector
>>> <lspector at hampshire.edu <mailto:lspector at hampshire.edu>> wrote:
>>>
>>> Thanks Chris!
>>>
>>> -Lee
>>>
>>>
>>> On Apr 20, 2015, at 6:47 PM, Chris Perry
>>> <perry at hampshire.edu <mailto:perry at hampshire.edu>> wrote:
>>>
>>>
>>> My forum post resulted in a solution that could work for
>>> us if we keep the slots as they are. The RemoteCmd line
>>> to use is:
>>>
>>> RemoteCmd { command } -samehost 1 -atmost 4 -atleast 4
>>> -service { blah }
>>>
>>> That will check out exactly 4 slots on a single node. So
>>> all Lee needs is to know how many slots are on the
>>> machine he wants and we (should be) good to go using a
>>> RemoteCmd structure like this.
>>>
>>> I’m happy to help you guys make these scripts work if
>>> and when the time comes.
>>>
>>> - chris
>>>
>>> On Apr 20, 2015, at 5:28 PM, Lee Spector
>>> <lspector at hampshire.edu <mailto:lspector at hampshire.edu>>
>>> wrote:
>>>
>>>
>>> This, in conjunction with our observations of weird
>>> multithreading performance pathologies on the JVM,
>>> makes me think that we SHOULDN'T change the slots
>>> per node from whatever we're currently doing, at
>>> least not universally or by default. Is there a way
>>> to make this work as it currently does most of the
>>> time, but for me to grab a machine in 1-slot mode
>>> when I want one?
>>>
>>> My use case is the exception, not the rule, and I
>>> usually need one or a small handful of these
>>> sessions going at a time (I run them individually
>>> and watch them). At the moment I'm just using my
>>> desktop mac, and I think that actually runs my code
>>> faster than any fly node at the moment (although 1-4
>>> seemed to perform well). I may want to run on a fly
>>> node or two at some point too, but I wouldn't want
>>> the performance of other work (especially Tom's many
>>> many large runs) to take a hit for this.
>>>
>>> -Lee
>>>
>>>
>>>
>>> On Apr 20, 2015, at 4:59 PM, Thomas Helmuth
>>> <thelmuth at cs.umass.edu
>>> <mailto:thelmuth at cs.umass.edu>> wrote:
>>>
>>> I agree with the others that it should be pretty
>>> easy to do everything you need. And I can walk
>>> you through how to easily do it in Clojush with
>>> the python script I have to launch runs.
>>>
>>> I'm not sure if I'd get more or less performance
>>> with one slot per node, but I'm guessing less --
>>> I often get multiple runs going on a node, so
>>> I'd need to get 4 or 8 times speedups with
>>> multi-threaded in a single run to get the same
>>> amount of work done. But, maybe this would just
>>> happen automatically with how we have it setup
>>> in Clojush.
>>>
>>> Tom
>>>
>>> On Mon, Apr 20, 2015 at 4:19 PM, Chris Perry
>>> <perry at hampshire.edu
>>> <mailto:perry at hampshire.edu>> wrote:
>>>
>>>
>>> I just wrote a post to the tractor developer
>>> forum asking about the one piece of this
>>> that I’m not 100% sure about, but if we go
>>> with Josiah’s recommendation of one slot per
>>> blade then my post will be irrelevant and
>>> YES we can do exactly what you want.
>>>
>>> And Josiah, I do think that you’re right
>>> about one slot per node, not only because of
>>> the great multithreading that everyone’s
>>> doing, but because our fileserver can’t keep
>>> up with too many concurrent reads/writes anyway.
>>>
>>> - chris
>>>
>>> On Apr 20, 2015, at 3:55 PM, Wm. Josiah
>>> Erikson <wjerikson at hampshire.edu
>>> <mailto:wjerikson at hampshire.edu>> wrote:
>>>
>>> >
>>> >
>>> > On 4/20/15 3:43 PM, Lee Spector wrote:
>>> >> Maybe it's finally time for me to do
>>> this. As Tom points out I should have a
>>> better handle on it before he leaves. But
>>> the runs that I do are a little different
>>> from what he usually does.
>>> >>
>>> >> Is it now possible to do this with all of
>>> these properties?:
>>> >>
>>> >> - My run starts immediately.
>>> > Yes. If you spool with a high enough
>>> priority, it will actually
>>> > terminate and re-spool other jobs that are
>>> in your way.
>>> >> - It runs on a node that I specify.
>>> > Yes. The service key would just be the
>>> node you want.
>>> >> - No other jobs run on that node until my
>>> run is done, even if mine temporarily goes
>>> to low cpu utilization.
>>> > We would have to make it so that your node
>>> had one slot, and then you
>>> > did something like this if you had more
>>> than one command:
>>> >
>>> http://fly.hampshire.edu/docs/Tractor/scripting.html#sharing-one-server-check-out-among-several-commands.
>>> > I think that most things that we are doing
>>> these days are multithreaded
>>> > enough that one slot per node might be a
>>> valid choice cluster-wide... at
>>> > least at the moment with Maxwell and
>>> prman. What say others?
>>> >> - I can see stderr as well as std out,
>>> which I'll be piping to a file (which has to
>>> be somewhere I can grep it).
>>> > Just pipe each one to whatever file you
>>> want. yes.
>>> >> - I can easily and instantly kill and
>>> restart my runs.
>>> > Yes. Right-click and restart, interrupt or
>>> delete.
>>> >>
>>> >> If so then yes, I guess I should try to
>>> switch to a tractor-based workflow for my
>>> work on fly.
>>> >>
>>> >> -Lee
>>> >>
>>> >>
>>> >>> On Apr 20, 2015, at 3:26 PM, Chris Perry
>>> <perry at hampshire.edu
>>> <mailto:perry at hampshire.edu>> wrote:
>>> >>>
>>> >>>
>>> >>> I have a different ideal solution to
>>> propose: use tractor!
>>> >>>
>>> >>> What you are looking to do with your
>>> NIMBY mechanism is to pull machines out of
>>> tractor temporarily so they can run your
>>> job(s). But this is exactly what tractor
>>> does: it gives a job to an idle machine then
>>> keeps that machine from running new jobs
>>> until the first job is killed or completed.
>>> If you spool with high enough priority,
>>> tractor kills and respools a lower priority
>>> job so as to free up the machine, so you
>>> don’t have to wait if immediate access is a
>>> concern of yours.
>>> >>>
>>> >>> Not to mention, the tractor interface
>>> will show running jobs so there would be a
>>> “Lee’s job” item in green on the list. This
>>> might help keep you less likely to lose
>>> track in the way you describe happening now.
>>> >>>
>>> >>> Worth considering? I’m sure there are a
>>> few kinks to figure out (such as how to tag
>>> your jobs so that they have the full
>>> machine, guaranteed) but I feel confident
>>> that we can do this.
>>> >>>
>>> >>> - chris
>>> >>>
>>> >>> On Apr 20, 2015, at 2:31 PM, Lee Spector
>>> <lspector at hampshire.edu
>>> <mailto:lspector at hampshire.edu>> wrote:
>>> >>>
>>> >>>> Some of these were probably my doing,
>>> but I only recall nimbying 1-4 and 4-5 in
>>> the recent past.
>>> >>>>
>>> >>>> It's not a problem with a node that
>>> causes me to do this, it's an interest in
>>> having total control of a node for a
>>> compute-intensive multithreaded run, with no
>>> chance that any other processes will be
>>> allocated on it. Sometimes I'll start one of
>>> these, check in on it regularly for a while,
>>> and then check in less frequently if it's
>>> not doing anything really interesting but
>>> I'm not ready to kill it in case it might
>>> still do something good. Then sometimes I
>>> lose track. Right now I have nothing
>>> running, and any nimbying that I've done can
>>> be unnimbyed, although I'm not 100% sure
>>> what I may have left nimbyed.
>>> >>>>
>>> >>>> Ideally, I guess, we'd have a nimby
>>> command that records who nimbyed and maybe
>>> periodically asks them if they still want it
>>> nimbyed. I'm not sure how difficult that
>>> would be. If somebody is inspired to do
>>> this, then it would also be nice if the
>>> command for numbying/unnimbying was more
>>> straightforward than the current one (which
>>> takes a 1 or a 0 as an argument, which is a
>>> little confusing), if it returned a value
>>> that made it more clear what the new status
>>> is, and if there was a simple way just to
>>> check the status (which maybe there is now?
>>> if so I don't know it).
>>> >>>>
>>> >>>> -Lee
>>> >>>>
>>> >>>>
>>> >>>>> On Apr 20, 2015, at 1:36 PM, Wm.
>>> Josiah Erikson <wjerikson at hampshire.edu
>>> <mailto:wjerikson at hampshire.edu>> wrote:
>>> >>>>>
>>> >>>>> Hi all,
>>> >>>>> Why are nodes 1-17, 1-18, 1-2, 1-4,
>>> and 1-9 NIMBYed? If you are
>>> >>>>> having a problem with a node that
>>> causes you to need to NIMBY it, please
>>> >>>>> let me know, because maybe it just
>>> means I screwed up the service keys
>>> >>>>> or something. I'm not omniscient :)
>>> >>>>>
>>> >>>>> --
>>> >>>>> Wm. Josiah Erikson
>>> >>>>> Assistant Director of IT,
>>> Infrastructure Group
>>> >>>>> System Administrator, School of CS
>>> >>>>> Hampshire College
>>> >>>>> Amherst, MA 01002
>>> >>>>> (413) 559-6091 <tel:%28413%29%20559-6091>
>>> >>>>>
>>> >>>>>
>>> _______________________________________________
>>> >>>>> Clusterusers mailing list
>>> >>>>> Clusterusers at lists.hampshire.edu
>>> <mailto:Clusterusers at lists.hampshire.edu>
>>> >>>>>
>>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>> >>>> --
>>> >>>> Lee Spector, Professor of Computer Science
>>> >>>> Director, Institute for Computational
>>> Intelligence
>>> >>>> Cognitive Science, Hampshire College
>>> >>>> 893 West Street, Amherst, MA 01002-3359
>>> >>>> lspector at hampshire.edu
>>> <mailto:lspector at hampshire.edu>,
>>> http://hampshire.edu/lspector/
>>> >>>> Phone: 413-559-5352 <tel:413-559-5352>,
>>> Fax: 413-559-5438 <tel:413-559-5438>
>>> >>>>
>>> >>>>
>>> _______________________________________________
>>> >>>> Clusterusers mailing list
>>> >>>> Clusterusers at lists.hampshire.edu
>>> <mailto:Clusterusers at lists.hampshire.edu>
>>> >>>>
>>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>> >>>
>>> _______________________________________________
>>> >>> Clusterusers mailing list
>>> >>> Clusterusers at lists.hampshire.edu
>>> <mailto:Clusterusers at lists.hampshire.edu>
>>> >>>
>>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>> >> --
>>> >> Lee Spector, Professor of Computer Science
>>> >> Director, Institute for Computational
>>> Intelligence
>>> >> Cognitive Science, Hampshire College
>>> >> 893 West Street, Amherst, MA 01002-3359
>>> >> lspector at hampshire.edu
>>> <mailto:lspector at hampshire.edu>,
>>> http://hampshire.edu/lspector/
>>> >> Phone: 413-559-5352 <tel:413-559-5352>,
>>> Fax: 413-559-5438 <tel:413-559-5438>
>>> >>
>>> >>
>>> _______________________________________________
>>> >> Clusterusers mailing list
>>> >> Clusterusers at lists.hampshire.edu
>>> <mailto:Clusterusers at lists.hampshire.edu>
>>> >>
>>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>> >
>>> > --
>>> > Wm. Josiah Erikson
>>> > Assistant Director of IT, Infrastructure Group
>>> > System Administrator, School of CS
>>> > Hampshire College
>>> > Amherst, MA 01002
>>> > (413) 559-6091 <tel:%28413%29%20559-6091>
>>> >
>>> >
>>> _______________________________________________
>>> > Clusterusers mailing list
>>> > Clusterusers at lists.hampshire.edu
>>> <mailto:Clusterusers at lists.hampshire.edu>
>>> >
>>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>
>>> _______________________________________________
>>> Clusterusers mailing list
>>> Clusterusers at lists.hampshire.edu
>>> <mailto:Clusterusers at lists.hampshire.edu>
>>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>
>>> _______________________________________________
>>> Clusterusers mailing list
>>> Clusterusers at lists.hampshire.edu
>>> <mailto:Clusterusers at lists.hampshire.edu>
>>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>
>>>
>>>
>>>
>>> --
>>> Lee Spector, Professor of Computer Science
>>> Director, Institute for Computational Intelligence
>>> Cognitive Science, Hampshire College
>>> 893 West Street, Amherst, MA 01002-3359
>>> lspector at hampshire.edu
>>> <mailto:lspector at hampshire.edu>, http://hampshire.edu/lspector/
>>> Phone: 413-559-5352 <tel:413-559-5352>, Fax:
>>> 413-559-5438 <tel:413-559-5438>
>>> _______________________________________________
>>> Clusterusers mailing list
>>> Clusterusers at lists.hampshire.edu
>>> <mailto:Clusterusers at lists.hampshire.edu>
>>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>
>>> _______________________________________________
>>> Clusterusers mailing list
>>> Clusterusers at lists.hampshire.edu
>>> <mailto:Clusterusers at lists.hampshire.edu>
>>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>
>>>
>>>
>>>
>>> --
>>> Lee Spector, Professor of Computer Science
>>> Director, Institute for Computational Intelligence
>>> Cognitive Science, Hampshire College
>>> 893 West Street, Amherst, MA 01002-3359
>>> lspector at hampshire.edu
>>> <mailto:lspector at hampshire.edu>, http://hampshire.edu/lspector/
>>> Phone: 413-559-5352 <tel:413-559-5352>, Fax: 413-559-5438
>>> <tel:413-559-5438>
>>>
>>> _______________________________________________
>>> Clusterusers mailing list
>>> Clusterusers at lists.hampshire.edu
>>> <mailto:Clusterusers at lists.hampshire.edu>
>>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>
>>>
>>> _______________________________________________
>>> Clusterusers mailing list
>>> Clusterusers at lists.hampshire.edu <mailto:Clusterusers at lists.hampshire.edu>
>>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>
>>>
>>>
>>> _______________________________________________
>>> Clusterusers mailing list
>>> Clusterusers at lists.hampshire.edu
>>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>
>> --
>> Wm. Josiah Erikson
>> Assistant Director of IT, Infrastructure Group
>> System Administrator, School of CS
>> Hampshire College
>> Amherst, MA 01002
>> (413) 559-6091
>>
>> _______________________________________________
>> Clusterusers mailing list
>> Clusterusers at lists.hampshire.edu <mailto:Clusterusers at lists.hampshire.edu>
>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>
>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> https://lists.hampshire.edu/mailman/listinfo/clusterusers
--
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
(413) 559-6091
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.hampshire.edu/pipermail/clusterusers/attachments/20150423/b11f2faa/attachment-0001.html>
More information about the Clusterusers
mailing list