[Clusterusers] breve segmentation faults on fly

Chris Perry perry at hampshire.edu
Tue Oct 30 16:38:52 EDT 2007


Josiah has it all ready once the current wave of render jobs  
finishes. I think he may try to do this from home tonight. We'll keep  
you posted. Thanks for responding so quickly!

- chris

On Oct 30, 2007, at 1:43 PM, Lee Spector wrote:

>
> No problem -- I've just killed my run. Let me know when I can start  
> a new one.
>
>  -Lee
>
> On Oct 30, 2007, at 11:36 AM, Chris Perry wrote:
>
>>
>> Lee -
>>
>> We need to upgrade some RenderMan software which will require
>> rebooting/re-imaging the cluster nodes. It's something we would  
>> like to do
>> soon, but I see that you've got a run going at the moment. Do you  
>> see a point
>> in the near future when we might be able to take things down for a  
>> bit?
>>
>> Thanks!
>>
>> - chris
>>
>> Quoting Lee Spector <lspector at hampshire.edu>:
>>
>>>
>>> Ah -- right. Yes, that gives me a pretty good idea, and then looking
>>> directly at the individual nodes in ganglia I can actually see
>>> exactly where the run starts and where the CPU stops. And....  
>>> nothing
>>> to see there, really. I've got one segfault in my current run which
>>> started around 14:00 yesterday, and I can see all of the nodes
>>> ramping up but then compute-1-9, without any other apparent  
>>> surges of
>>> activity, drops off around 20:00.
>>>
>>>   -Lee
>>>
>>> On Oct 29, 2007, at 9:42 PM, Chris Perry wrote:
>>>
>>>>
>>>> Maybe your log/output files for those runs stop when they segfault?
>>>> That might be a timestamp you can use..
>>>>
>>>> - chris
>>>>
>>>> On Oct 29, 2007, at 9:19 PM, Lee Spector wrote:
>>>>
>>>>>
>>>>> Interesting idea but not trivial to tell, since the seg fault
>>>>> messages aren't time stamped and I'm not sure exactly when they
>>>>> happened.... but this might be worth looking into more  
>>>>> carefully...
>>>>>
>>>>>  -Lee
>>>>>
>>>>> On Oct 29, 2007, at 9:06 PM, Chris Perry wrote:
>>>>>
>>>>>>
>>>>>> I know we've been rendering fairly hard recently - do the graphs
>>>>>> on ganglia show a correlation between your seg fault times/
>>>>>> machines and a lack of RAM on those machines? I would expect a
>>>>>> different error than seg fault in that case, but it's worth
>>>>>> seeing if there's a possible connection.
>>>>>>
>>>>>> - chris
>>>>>>
>>>>>> On Oct 29, 2007, at 2:09 PM, Lee Spector wrote:
>>>>>>
>>>>>>>
>>>>>>> Great -- I didn't see any core files, but where would I find
>>>>>>> them from a run like this?
>>>>>>>
>>>>>>>  -Lee
>>>>>>>
>>>>>>> On Oct 29, 2007, at 1:55 PM, jon klein wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> I can't really say why that build of breve is crashing for you,
>>>>>>>> but it is pretty old at this point and due to be updated soon.
>>>>>>>> The latest builds do have some potential memory savings, so it
>>>>>>>> could help out.  I'll make a new build soon and let you know
>>>>>>>> when it's installed.
>>>>>>>>
>>>>>>>> Did the crashes leave any core files behind?  That would be the
>>>>>>>> only way to get a hint of why they might have crashed.
>>>>>>>>
>>>>>>>> -- jon klein
>>>>>>>>
>>>>>>>>
>>>>>>>> On Oct 29, 2007, at 12:41 PM, Lee Spector wrote:
>>>>>>>>
>>>>>>>>> On Oct 29, 2007, at 11:24 AM, Wm. Josiah Erikson wrote:
>>>>>>>>>> Hum. Well, uh, unless you tell me something that seems like
>>>>>>>>>> evidence is to the contrary, I'll assume the ball is in some
>>>>>>>>>> court other than mine for the time being?
>>>>>>>>>
>>>>>>>>> Josiah: I guess so. The OS memory limit seemed like such a
>>>>>>>>> good theory that I'm having a hard time letting go of it :-),
>>>>>>>>> but if the system is back to where it was previously then I
>>>>>>>>> don't see what could be causing this on your end.
>>>>>>>>>
>>>>>>>>> Jon: I'm using  /share/apps/breve/dev/bin/breve_cli, which
>>>>>>>>> looks to be a version that has been there since May:
>>>>>>>>>
>>>>>>>>> $ ls -l  /share/apps/breve/dev/bin/breve_cli
>>>>>>>>> -rwxr-xr-x  1 1000 1000 305 May 26 07:22 /share/apps/breve/ 
>>>>>>>>> dev/
>>>>>>>>> bin/breve_cli
>>>>>>>>>
>>>>>>>>> Might that have something to do with this, and should it be
>>>>>>>>> updated?
>>>>>>>>>
>>>>>>>>> I can kill my current run at any point if I should do that for
>>>>>>>>> upgrading breve.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>>  -Lee
>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Lee Spector, Professor of Computer Science
>>>>>>> School of Cognitive Science, Hampshire College
>>>>>>> 893 West Street, Amherst, MA 01002-3359
>>>>>>> lspector at hampshire.edu, http://hampshire.edu/lspector/
>>>>>>> Phone: 413-559-5352, Fax: 413-559-5438
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Clusterusers mailing list
>>>>>>> Clusterusers at lists.hampshire.edu
>>>>>>> http://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>>>>
>>>>>> _______________________________________________
>>>>>> Clusterusers mailing list
>>>>>> Clusterusers at lists.hampshire.edu
>>>>>> http://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>>>
>>>>> --
>>>>> Lee Spector, Professor of Computer Science
>>>>> School of Cognitive Science, Hampshire College
>>>>> 893 West Street, Amherst, MA 01002-3359
>>>>> lspector at hampshire.edu, http://hampshire.edu/lspector/
>>>>> Phone: 413-559-5352, Fax: 413-559-5438
>>>>>
>>>>> _______________________________________________
>>>>> Clusterusers mailing list
>>>>> Clusterusers at lists.hampshire.edu
>>>>> http://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>>
>>>> _______________________________________________
>>>> Clusterusers mailing list
>>>> Clusterusers at lists.hampshire.edu
>>>> http://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>
>>> --
>>> Lee Spector, Professor of Computer Science
>>> School of Cognitive Science, Hampshire College
>>> 893 West Street, Amherst, MA 01002-3359
>>> lspector at hampshire.edu, http://hampshire.edu/lspector/
>>> Phone: 413-559-5352, Fax: 413-559-5438
>>>
>>> _______________________________________________
>>> Clusterusers mailing list
>>> Clusterusers at lists.hampshire.edu
>>> http://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>
>>
>>
>> _______________________________________________
>> Clusterusers mailing list
>> Clusterusers at lists.hampshire.edu
>> http://lists.hampshire.edu/mailman/listinfo/clusterusers
>
> --
> Lee Spector, Professor of Computer Science
> School of Cognitive Science, Hampshire College
> 893 West Street, Amherst, MA 01002-3359
> lspector at hampshire.edu, http://hampshire.edu/lspector/
> Phone: 413-559-5352, Fax: 413-559-5438
>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> http://lists.hampshire.edu/mailman/listinfo/clusterusers




More information about the Clusterusers mailing list