[Clusterusers] Switching to new keg

Mon Jun 22 16:38:35 EDT 2015

I've got a couple of nodes (compute-1-4 and compute-2-27) that won't
remount /helga. It looks, Tom, like your processes are still running and
won't let go of /helga because they were in fact writing logs to it,
which I did not anticipate. If I can reboot those nodes, we should be
all set:

[root at fly install]# ssh compute-1-4
Rocks Compute Node
Rocks 6.1.1 (Sand Boa)
Profile built 09:02 08-Jun-2015

Kickstarted 09:09 08-Jun-2015
[root at compute-1-4 ~]# ls /helga
ls: cannot access /helga: Stale file handle
[root at compute-1-4 ~]# umount -f /helga
umount2: Device or resource busy
umount.nfs: /helga: device is busy
umount2: Device or resource busy
umount.nfs: /helga: device is busy
[root at compute-1-4 ~]# ps aux | grep max
root       723  0.0  0.0 105312   864 pts/0    S+   16:31   0:00 grep max
root      3717  0.1  0.0 349216 13068 ?        Sl   Jun08  38:24
/opt/pixar/tractor-blade-1.7.2/python/bin/python-bin
/opt/pixar/tractor-blade-1.7.2/blade-modules/tractor-blade.py
--dyld_framework_path=None
--ld_library_path=/opt/pixar/tractor-blade-1.7.2/python/lib:/opt/gridengine/lib/linux-x64:/opt/openmpi/lib:/opt/python/lib:/share/apps/maxwell-3.1:/share/apps/maxwell64-3.1
--debug --log /var/log/tractor-blade.log -P 8000
[root at compute-1-4 ~]# ps aux | grep Max
root       725  0.0  0.0 105312   864 pts/0    S+   16:32   0:00 grep Max
[root at compute-1-4 ~]# lsof | grep /helga
lsof: WARNING: can't stat() nfs4 file system /helga
      Output information may be incomplete.
python-bi  3717      root    5w 
unknown                                       
/helga/public_html/tractor/cmd-logs/thelmuth/J1506210001/T2.log (stat:
Stale file handle)
python-bi  3717      root    8w 
unknown                                       
/helga/public_html/tractor/cmd-logs/thelmuth/J1506210001/T9.log (stat:
Stale file handle)
python-bi  3717      root   12w 
unknown                                       
/helga/public_html/tractor/cmd-logs/thelmuth/J1506210001/T14.log (stat:
Stale file handle)
python-bi  3717      root   16w 
unknown                                       
/helga/public_html/tractor/cmd-logs/thelmuth/J1506210001/T24.log (stat:
Stale file handle)
[root at compute-1-4 ~]# top  

top - 16:32:23 up 14 days,  7:21,  1 user,  load average: 33.77, 35.33,
35.73
Tasks: 230 total,   1 running, 229 sleeping,   0 stopped,   0 zombie
Cpu(s): 50.2%us,  0.2%sy,  0.0%ni, 49.6%id,  0.0%wa,  0.0%hi,  0.0%si, 
0.0%st
Mem:  15928760k total,  8015840k used,  7912920k free,   238044k buffers
Swap:  2047992k total,        0k used,  2047992k free,  1094888k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+ 
COMMAND                      
 8428 thelmuth  20   0 6089m 1.5g  11m S 254.9 10.1   2809:37
java                         
 8499 thelmuth  20   0 6089m 1.5g  11m S 243.6 10.1   2829:15
java                         
 8367 thelmuth  20   0 6089m 1.5g  11m S 236.0 10.0   2829:42
java                         
 8397 thelmuth  20   0 6089m 1.5g  11m S 28.3 10.1   2787:46
java                          
  737 root      20   0 15036 1220  840 R  1.9  0.0   0:00.01
top                           
 1536 root      20   0     0    0    0 S  1.9  0.0  10:24.07
kondemand/6                   
    1 root      20   0 19344 1552 1232 S  0.0  0.0   0:00.46
init                          
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.02
kthreadd                      
    3 root      RT   0     0    0    0 S  0.0  0.0   0:00.53
migration/0                   
    4 root      20   0     0    0    0 S  0.0  0.0   0:02.85
ksoftirqd/0                   
    5 root      RT   0     0    0    0 S  0.0  0.0   0:00.00
migration/0                   
    6 root      RT   0     0    0    0 S  0.0  0.0   0:01.50
watchdog/0                    
    7 root      RT   0     0    0    0 S  0.0  0.0   0:00.49
migration/1                   
    8 root      RT   0     0    0    0 S  0.0  0.0   0:00.00
migration/1                   
    9 root      20   0     0    0    0 S  0.0  0.0   0:01.85
ksoftirqd/1                   
   10 root      RT   0     0    0    0 S  0.0  0.0   0:01.21
watchdog/1                    
   11 root      RT   0     0    0    0 S  0.0  0.0   0:00.50
migration/2                   
   12 root      RT   0     0    0    0 S  0.0  0.0   0:00.00
migration/2                   
   13 root      20   0     0    0    0 S  0.0  0.0   0:02.12
ksoftirqd/2                   
   14 root      RT   0     0    0    0 S  0.0  0.0   0:01.06
watchdog/2                    
   15 root      RT   0     0    0    0 S  0.0  0.0   0:01.81
migration/3                   
   16 root      RT   0     0    0    0 S  0.0  0.0   0:00.00
migration/3                   
   17 root      20   0     0    0    0 S  0.0  0.0   0:00.50
ksoftirqd/3                   
   18 root      RT   0     0    0    0 S  0.0  0.0   0:00.87
watchdog/3                    
   19 root      RT   0     0    0    0 S  0.0  0.0   0:01.76
migration/4                   
   20 root      RT   0     0    0    0 S  0.0  0.0   0:00.00
migration/4                   
   21 root      20   0     0    0    0 S  0.0  0.0   0:04.57
ksoftirqd/4                   
   22 root      RT   0     0    0    0 S  0.0  0.0   0:01.03
watchdog/4                    
   23 root      RT   0     0    0    0 S  0.0  0.0   0:00.51
migration/5                   
   24 root      RT   0     0    0    0 S  0.0  0.0   0:00.00
migration/5                   
   25 root      20   0     0    0    0 S  0.0  0.0   0:02.27
ksoftirqd/5                   
[root at compute-1-4 ~]# umount -f /helga
umount2: Device or resource busy
umount.nfs: /helga: device is busy
umount2: Device or resource busy
umount.nfs: /helga: device is busy
[root at compute-1-4 ~]# umount -f /helga
umount2: Device or resource busy
umount.nfs: /helga: device is busy
umount2: Device or resource busy
umount.nfs: /helga: device is busy
[root at compute-1-4 ~]# mount -o remount /helga
mount.nfs: Stale file handle
[root at compute-1-4 ~]# mount -o remount /helga
mount.nfs: Stale file handle
[root at compute-1-4 ~]# top

top - 16:33:19 up 14 days,  7:22,  1 user,  load average: 36.17, 35.76,
35.86
Tasks: 229 total,   1 running, 228 sleeping,   0 stopped,   0 zombie
Cpu(s): 99.6%us,  0.4%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si, 
0.0%st
Mem:  15928760k total,  8016408k used,  7912352k free,   238044k buffers
Swap:  2047992k total,        0k used,  2047992k free,  1094888k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+ 
COMMAND                      
 8499 thelmuth  20   0 6089m 1.5g  11m S 209.2 10.1   2831:08
java                         
 8397 thelmuth  20   0 6089m 1.5g  11m S 198.9 10.1   2789:37
java                         
 8367 thelmuth  20   0 6089m 1.5g  11m S 196.2 10.0   2831:37
java                         
 8428 thelmuth  20   0 6089m 1.5g  11m S 192.9 10.1   2811:28
java                         
  764 root      20   0 15164 1352  952 R  0.7  0.0   0:00.03
top                           
 1532 root      20   0     0    0    0 S  0.3  0.0  20:31.35
kondemand/2                   
 1534 root      20   0     0    0    0 S  0.3  0.0  12:32.23
kondemand/4                   
 1536 root      20   0     0    0    0 S  0.3  0.0  10:24.14
kondemand/6                   
 1537 root      20   0     0    0    0 S  0.3  0.0  17:00.03
kondemand/7                   
 3717 root      20   0  341m  12m 3000 S  0.3  0.1  38:24.92
python-bin                    
    1 root      20   0 19344 1552 1232 S  0.0  0.0   0:00.46
init                          
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.02
kthreadd                      
    3 root      RT   0     0    0    0 S  0.0  0.0   0:00.53
migration/0                   
    4 root      20   0     0    0    0 S  0.0  0.0   0:02.85
ksoftirqd/0                   
    5 root      RT   0     0    0    0 S  0.0  0.0   0:00.00
migration/0                   
    6 root      RT   0     0    0    0 S  0.0  0.0   0:01.50
watchdog/0                    
    7 root      RT   0     0    0    0 S  0.0  0.0   0:00.49
migration/1                   
    8 root      RT   0     0    0    0 S  0.0  0.0   0:00.00
migration/1                   
    9 root      20   0     0    0    0 S  0.0  0.0   0:01.85
ksoftirqd/1                   
   10 root      RT   0     0    0    0 S  0.0  0.0   0:01.21
watchdog/1                    
   11 root      RT   0     0    0    0 S  0.0  0.0   0:00.50
migration/2                   
   12 root      RT   0     0    0    0 S  0.0  0.0   0:00.00
migration/2                   
   13 root      20   0     0    0    0 S  0.0  0.0   0:02.12
ksoftirqd/2                   
   14 root      RT   0     0    0    0 S  0.0  0.0   0:01.06
watchdog/2                    
   15 root      RT   0     0    0    0 S  0.0  0.0   0:01.81
migration/3                   
   16 root      RT   0     0    0    0 S  0.0  0.0   0:00.00
migration/3                   
   17 root      20   0     0    0    0 S  0.0  0.0   0:00.50
ksoftirqd/3                   
   18 root      RT   0     0    0    0 S  0.0  0.0   0:00.87
watchdog/3                    
   19 root      RT   0     0    0    0 S  0.0  0.0   0:01.76
migration/4                   
   20 root      RT   0     0    0    0 S  0.0  0.0   0:00.00
migration/4                   
   21 root      20   0     0    0    0 S  0.0  0.0   0:04.57
ksoftirqd/4                   
[root at compute-1-4 ~]# umount -f /helga
umount2: Device or resource busy
umount.nfs: /helga: device is busy
umount2: Device or resource busy
umount.nfs: /helga: device is busy
[root at compute-1-4 ~]# mount /helga
mount.nfs: Stale file handle
[root at compute-1-4 ~]# logout
Connection to compute-1-4 closed.
[root at fly install]# ssh compute-2-17
Rocks Compute Node
Rocks 6.1.1 (Sand Boa)
Profile built 23:42 25-Dec-2014

Kickstarted 23:58 25-Dec-2014
[root at compute-2-17 ~]# ls /helga
ls: cannot access /helga: Stale file handle
[root at compute-2-17 ~]# umount -f /helga
umount2: Device or resource busy
umount.nfs: /helga: device is busy
umount2: Device or resource busy
umount.nfs: /helga: device is busy
[root at compute-2-17 ~]# lsof | grep /helga
lsof: WARNING: can't stat() nfs4 file system /helga
      Output information may be incomplete.
python-bi 28615      root    8w 
unknown                                         
/helga/public_html/tractor/cmd-logs/hz12/J1504210040/T3.log (stat: Stale
file handle)
[root at compute-2-17 ~]#

On 6/22/15 4:33 PM, Wm. Josiah Erikson wrote:
> Uh... shouldn't have been. I wonder if it was because I yanked its
> logfile place out from under it? I did not expect that to happen! It's
> back now, but it looks like all of your runs errored out. I'm so sorry!
>     -Josiah
>
>
> On 6/22/15 4:00 PM, Thomas Helmuth wrote:
>> Very cool!
>>
>> It looks like tractor is down. Is that related to this move?
>>
>> Tom
>>
>> On Mon, Jun 22, 2015 at 1:50 PM, Wm. Josiah Erikson
>> <wjerikson at hampshire.edu <mailto:wjerikson at hampshire.edu>> wrote:
>>
>>     Keg will be going down shortly, and coming back up as a harder,
>>     faster,
>>     better, stronger version shortly as well, I hope :)
>>         -Josiah
>>
>>
>>     On 6/11/15 9:53 AM, Wm. Josiah Erikson wrote:
>>     > Hi all,
>>     >     I'm pretty sure there is 100% overlap between people who care
>>     > about fly and people who care about keg at this point (though there
>>     > are some people who care about fly and not so much keg, like
>>     Lee and
>>     > Tom - sorry), so I'm sending this to this list.
>>     >     I have a new 32TB 14+2 RAID6 with 24GB of RAM superfast (way
>>     > faster than gigabit) keg all ready to go! I would like to bring
>>     it up
>>     > on Monday, June 22nd. It would be ideal if rendering was NOT
>>     happening
>>     > at that time, to make my rsyncing life easier :) Any objections?
>>     >     -Josiah
>>     >
>>
>>     --
>>     Wm. Josiah Erikson
>>     Assistant Director of IT, Infrastructure Group
>>     System Administrator, School of CS
>>     Hampshire College
>>     Amherst, MA 01002
>>     (413) 559-6091 <tel:%28413%29%20559-6091>
>>
>>     _______________________________________________
>>     Clusterusers mailing list
>>     Clusterusers at lists.hampshire.edu
>>     <mailto:Clusterusers at lists.hampshire.edu>
>>     https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>
>>
>>
>>
>> _______________________________________________
>> Clusterusers mailing list
>> Clusterusers at lists.hampshire.edu
>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>
> -- 
> Wm. Josiah Erikson
> Assistant Director of IT, Infrastructure Group
> System Administrator, School of CS
> Hampshire College
> Amherst, MA 01002
> (413) 559-6091
>
>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> https://lists.hampshire.edu/mailman/listinfo/clusterusers

-- 
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
(413) 559-6091

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.hampshire.edu/pipermail/clusterusers/attachments/20150622/788e5f6b/attachment-0001.html>