[Clusterusers] NIMBYed nodes

Thomas Helmuth thelmuth at cs.umass.edu
Mon Apr 20 16:59:09 EDT 2015


I agree with the others that it should be pretty easy to do everything you
need. And I can walk you through how to easily do it in Clojush with the
python script I have to launch runs.

I'm not sure if I'd get more or less performance with one slot per node,
but I'm guessing less -- I often get multiple runs going on a node, so I'd
need to get 4 or 8 times speedups with multi-threaded in a single run to
get the same amount of work done. But, maybe this would just happen
automatically with how we have it setup in Clojush.

Tom

On Mon, Apr 20, 2015 at 4:19 PM, Chris Perry <perry at hampshire.edu> wrote:

>
> I just wrote a post to the tractor developer forum asking about the one
> piece of this that I'm not 100% sure about, but if we go with Josiah's
> recommendation of one slot per blade then my post will be irrelevant and
> YES we can do exactly what you want.
>
> And Josiah, I do think that you're right about one slot per node, not only
> because of the great multithreading that everyone's doing, but because our
> fileserver can't keep up with too many concurrent reads/writes anyway.
>
> - chris
>
> On Apr 20, 2015, at 3:55 PM, Wm. Josiah Erikson <wjerikson at hampshire.edu>
> wrote:
>
> >
> >
> > On 4/20/15 3:43 PM, Lee Spector wrote:
> >> Maybe it's finally time for me to do this. As Tom points out I should
> have a better handle on it before he leaves. But the runs that I do are a
> little different from what he usually does.
> >>
> >> Is it now possible to do this with all of these properties?:
> >>
> >> - My run starts immediately.
> > Yes. If you spool with a high enough priority, it will actually
> > terminate and re-spool other jobs that are in your way.
> >> - It runs on a node that I specify.
> > Yes. The service key would just be the node you want.
> >> - No other jobs run on that node until my run is done, even if mine
> temporarily goes to low cpu utilization.
> > We would have to make it so that your node had one slot, and then you
> > did something like this if you had more than one command:
> >
> http://fly.hampshire.edu/docs/Tractor/scripting.html#sharing-one-server-check-out-among-several-commands
> .
> > I think that most things that we are doing these days are multithreaded
> > enough that one slot per node might be a valid choice cluster-wide... at
> > least at the moment with Maxwell and prman. What say others?
> >> - I can see stderr as well as std out, which I'll be piping to a file
> (which has to be somewhere I can grep it).
> > Just pipe each one to whatever file you want. yes.
> >> - I can easily and instantly kill and restart my runs.
> > Yes. Right-click and restart, interrupt or delete.
> >>
> >> If so then yes, I guess I should try to switch to a tractor-based
> workflow for my work on fly.
> >>
> >> -Lee
> >>
> >>
> >>> On Apr 20, 2015, at 3:26 PM, Chris Perry <perry at hampshire.edu> wrote:
> >>>
> >>>
> >>> I have a different ideal solution to propose: use tractor!
> >>>
> >>> What you are looking to do with your NIMBY mechanism is to pull
> machines out of tractor temporarily so they can run your job(s). But this
> is exactly what tractor does: it gives a job to an idle machine then keeps
> that machine from running new jobs until the first job is killed or
> completed. If you spool with high enough priority, tractor kills and
> respools a lower priority job so as to free up the machine, so you don't
> have to wait if immediate access is a concern of yours.
> >>>
> >>> Not to mention, the tractor interface will show running jobs so there
> would be a "Lee's job" item in green on the list. This might help keep you
> less likely to lose track in the way you describe happening now.
> >>>
> >>> Worth considering? I'm sure there are a few kinks to figure out (such
> as how to tag your jobs so that they have the full machine, guaranteed) but
> I feel confident that we can do this.
> >>>
> >>> - chris
> >>>
> >>> On Apr 20, 2015, at 2:31 PM, Lee Spector <lspector at hampshire.edu>
> wrote:
> >>>
> >>>> Some of these were probably my doing, but I only recall nimbying 1-4
> and 4-5 in the recent past.
> >>>>
> >>>> It's not a problem with a node that causes me to do this, it's an
> interest in having total control of a node for a compute-intensive
> multithreaded run, with no chance that any other processes will be
> allocated on it. Sometimes I'll start one of these, check in on it
> regularly for a while, and then check in less frequently if it's not doing
> anything really interesting but I'm not ready to kill it in case it might
> still do something good. Then sometimes I lose track. Right now I have
> nothing running, and any nimbying that I've done can be unnimbyed, although
> I'm not 100% sure what I may have left nimbyed.
> >>>>
> >>>> Ideally, I guess, we'd have a nimby command that records who nimbyed
> and maybe periodically asks them if they still want it nimbyed. I'm not
> sure how difficult that would be. If somebody is inspired to do this, then
> it would also be nice if the command for numbying/unnimbying was more
> straightforward than the current one (which takes a 1 or a 0 as an
> argument, which is a little confusing), if it returned a value that made it
> more clear what the new status is, and if there was a simple way just to
> check the status (which maybe there is now? if so I don't know it).
> >>>>
> >>>> -Lee
> >>>>
> >>>>
> >>>>> On Apr 20, 2015, at 1:36 PM, Wm. Josiah Erikson <
> wjerikson at hampshire.edu> wrote:
> >>>>>
> >>>>> Hi all,
> >>>>> Why are nodes 1-17, 1-18, 1-2, 1-4, and 1-9 NIMBYed? If you are
> >>>>> having a problem with a node that causes you to need to NIMBY it,
> please
> >>>>> let me know, because maybe it just means I screwed up the service
> keys
> >>>>> or something. I'm not omniscient :)
> >>>>>
> >>>>> --
> >>>>> Wm. Josiah Erikson
> >>>>> Assistant Director of IT, Infrastructure Group
> >>>>> System Administrator, School of CS
> >>>>> Hampshire College
> >>>>> Amherst, MA 01002
> >>>>> (413) 559-6091
> >>>>>
> >>>>> _______________________________________________
> >>>>> Clusterusers mailing list
> >>>>> Clusterusers at lists.hampshire.edu
> >>>>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
> >>>> --
> >>>> Lee Spector, Professor of Computer Science
> >>>> Director, Institute for Computational Intelligence
> >>>> Cognitive Science, Hampshire College
> >>>> 893 West Street, Amherst, MA 01002-3359
> >>>> lspector at hampshire.edu, http://hampshire.edu/lspector/
> >>>> Phone: 413-559-5352, Fax: 413-559-5438
> >>>>
> >>>> _______________________________________________
> >>>> Clusterusers mailing list
> >>>> Clusterusers at lists.hampshire.edu
> >>>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
> >>> _______________________________________________
> >>> Clusterusers mailing list
> >>> Clusterusers at lists.hampshire.edu
> >>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
> >> --
> >> Lee Spector, Professor of Computer Science
> >> Director, Institute for Computational Intelligence
> >> Cognitive Science, Hampshire College
> >> 893 West Street, Amherst, MA 01002-3359
> >> lspector at hampshire.edu, http://hampshire.edu/lspector/
> >> Phone: 413-559-5352, Fax: 413-559-5438
> >>
> >> _______________________________________________
> >> Clusterusers mailing list
> >> Clusterusers at lists.hampshire.edu
> >> https://lists.hampshire.edu/mailman/listinfo/clusterusers
> >
> > --
> > Wm. Josiah Erikson
> > Assistant Director of IT, Infrastructure Group
> > System Administrator, School of CS
> > Hampshire College
> > Amherst, MA 01002
> > (413) 559-6091
> >
> > _______________________________________________
> > Clusterusers mailing list
> > Clusterusers at lists.hampshire.edu
> > https://lists.hampshire.edu/mailman/listinfo/clusterusers
>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.hampshire.edu/pipermail/clusterusers/attachments/20150420/be9fc401/attachment-0001.html>


More information about the Clusterusers mailing list