[Clusterusers] breve segmentation faults on fly

Chris Perry perry at hampshire.edu
Tue Oct 30 11:36:46 EDT 2007


Lee -

We need to upgrade some RenderMan software which will require
rebooting/re-imaging the cluster nodes. It's something we would like to do
soon, but I see that you've got a run going at the moment. Do you see a point
in the near future when we might be able to take things down for a bit?

Thanks!

- chris

Quoting Lee Spector <lspector at hampshire.edu>:

>
> Ah -- right. Yes, that gives me a pretty good idea, and then looking
> directly at the individual nodes in ganglia I can actually see
> exactly where the run starts and where the CPU stops. And.... nothing
> to see there, really. I've got one segfault in my current run which
> started around 14:00 yesterday, and I can see all of the nodes
> ramping up but then compute-1-9, without any other apparent surges of
> activity, drops off around 20:00.
>
>   -Lee
>
> On Oct 29, 2007, at 9:42 PM, Chris Perry wrote:
>
> >
> > Maybe your log/output files for those runs stop when they segfault?
> > That might be a timestamp you can use..
> >
> > - chris
> >
> > On Oct 29, 2007, at 9:19 PM, Lee Spector wrote:
> >
> >>
> >> Interesting idea but not trivial to tell, since the seg fault
> >> messages aren't time stamped and I'm not sure exactly when they
> >> happened.... but this might be worth looking into more carefully...
> >>
> >>  -Lee
> >>
> >> On Oct 29, 2007, at 9:06 PM, Chris Perry wrote:
> >>
> >>>
> >>> I know we've been rendering fairly hard recently - do the graphs
> >>> on ganglia show a correlation between your seg fault times/
> >>> machines and a lack of RAM on those machines? I would expect a
> >>> different error than seg fault in that case, but it's worth
> >>> seeing if there's a possible connection.
> >>>
> >>> - chris
> >>>
> >>> On Oct 29, 2007, at 2:09 PM, Lee Spector wrote:
> >>>
> >>>>
> >>>> Great -- I didn't see any core files, but where would I find
> >>>> them from a run like this?
> >>>>
> >>>>  -Lee
> >>>>
> >>>> On Oct 29, 2007, at 1:55 PM, jon klein wrote:
> >>>>
> >>>>>
> >>>>> I can't really say why that build of breve is crashing for you,
> >>>>> but it is pretty old at this point and due to be updated soon.
> >>>>> The latest builds do have some potential memory savings, so it
> >>>>> could help out.  I'll make a new build soon and let you know
> >>>>> when it's installed.
> >>>>>
> >>>>> Did the crashes leave any core files behind?  That would be the
> >>>>> only way to get a hint of why they might have crashed.
> >>>>>
> >>>>> -- jon klein
> >>>>>
> >>>>>
> >>>>> On Oct 29, 2007, at 12:41 PM, Lee Spector wrote:
> >>>>>
> >>>>>> On Oct 29, 2007, at 11:24 AM, Wm. Josiah Erikson wrote:
> >>>>>>> Hum. Well, uh, unless you tell me something that seems like
> >>>>>>> evidence is to the contrary, I'll assume the ball is in some
> >>>>>>> court other than mine for the time being?
> >>>>>>
> >>>>>> Josiah: I guess so. The OS memory limit seemed like such a
> >>>>>> good theory that I'm having a hard time letting go of it :-),
> >>>>>> but if the system is back to where it was previously then I
> >>>>>> don't see what could be causing this on your end.
> >>>>>>
> >>>>>> Jon: I'm using  /share/apps/breve/dev/bin/breve_cli, which
> >>>>>> looks to be a version that has been there since May:
> >>>>>>
> >>>>>> $ ls -l  /share/apps/breve/dev/bin/breve_cli
> >>>>>> -rwxr-xr-x  1 1000 1000 305 May 26 07:22 /share/apps/breve/dev/
> >>>>>> bin/breve_cli
> >>>>>>
> >>>>>> Might that have something to do with this, and should it be
> >>>>>> updated?
> >>>>>>
> >>>>>> I can kill my current run at any point if I should do that for
> >>>>>> upgrading breve.
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>>  -Lee
> >>>>>>
> >>>>
> >>>> --
> >>>> Lee Spector, Professor of Computer Science
> >>>> School of Cognitive Science, Hampshire College
> >>>> 893 West Street, Amherst, MA 01002-3359
> >>>> lspector at hampshire.edu, http://hampshire.edu/lspector/
> >>>> Phone: 413-559-5352, Fax: 413-559-5438
> >>>>
> >>>> _______________________________________________
> >>>> Clusterusers mailing list
> >>>> Clusterusers at lists.hampshire.edu
> >>>> http://lists.hampshire.edu/mailman/listinfo/clusterusers
> >>>
> >>> _______________________________________________
> >>> Clusterusers mailing list
> >>> Clusterusers at lists.hampshire.edu
> >>> http://lists.hampshire.edu/mailman/listinfo/clusterusers
> >>
> >> --
> >> Lee Spector, Professor of Computer Science
> >> School of Cognitive Science, Hampshire College
> >> 893 West Street, Amherst, MA 01002-3359
> >> lspector at hampshire.edu, http://hampshire.edu/lspector/
> >> Phone: 413-559-5352, Fax: 413-559-5438
> >>
> >> _______________________________________________
> >> Clusterusers mailing list
> >> Clusterusers at lists.hampshire.edu
> >> http://lists.hampshire.edu/mailman/listinfo/clusterusers
> >
> > _______________________________________________
> > Clusterusers mailing list
> > Clusterusers at lists.hampshire.edu
> > http://lists.hampshire.edu/mailman/listinfo/clusterusers
>
> --
> Lee Spector, Professor of Computer Science
> School of Cognitive Science, Hampshire College
> 893 West Street, Amherst, MA 01002-3359
> lspector at hampshire.edu, http://hampshire.edu/lspector/
> Phone: 413-559-5352, Fax: 413-559-5438
>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> http://lists.hampshire.edu/mailman/listinfo/clusterusers
>





More information about the Clusterusers mailing list