[Clusterusers] Switching to new keg

Thomas Helmuth thelmuth at cs.umass.edu
Mon Jun 22 16:51:59 EDT 2015


Go for it.

On Mon, Jun 22, 2015 at 4:38 PM, Wm. Josiah Erikson <wjerikson at hampshire.edu
> wrote:

>  I've got a couple of nodes (compute-1-4 and compute-2-27) that won't
> remount /helga. It looks, Tom, like your processes are still running and
> won't let go of /helga because they were in fact writing logs to it, which
> I did not anticipate. If I can reboot those nodes, we should be all set:
>
> [root at fly install]# ssh compute-1-4
> Rocks Compute Node
> Rocks 6.1.1 (Sand Boa)
> Profile built 09:02 08-Jun-2015
>
> Kickstarted 09:09 08-Jun-2015
> [root at compute-1-4 ~]# ls /helga
> ls: cannot access /helga: Stale file handle
> [root at compute-1-4 ~]# umount -f /helga
> umount2: Device or resource busy
> umount.nfs: /helga: device is busy
> umount2: Device or resource busy
> umount.nfs: /helga: device is busy
> [root at compute-1-4 ~]# ps aux | grep max
> root       723  0.0  0.0 105312   864 pts/0    S+   16:31   0:00 grep max
> root      3717  0.1  0.0 349216 13068 ?        Sl   Jun08  38:24
> /opt/pixar/tractor-blade-1.7.2/python/bin/python-bin
> /opt/pixar/tractor-blade-1.7.2/blade-modules/tractor-blade.py
> --dyld_framework_path=None
> --ld_library_path=/opt/pixar/tractor-blade-1.7.2/python/lib:/opt/gridengine/lib/linux-x64:/opt/openmpi/lib:/opt/python/lib:/share/apps/maxwell-3.1:/share/apps/maxwell64-3.1
> --debug --log /var/log/tractor-blade.log -P 8000
> [root at compute-1-4 ~]# ps aux | grep Max
> root       725  0.0  0.0 105312   864 pts/0    S+   16:32   0:00 grep Max
> [root at compute-1-4 ~]# lsof | grep /helga
> lsof: WARNING: can't stat() nfs4 file system /helga
>       Output information may be incomplete.
> python-bi  3717      root    5w
> unknown
> /helga/public_html/tractor/cmd-logs/thelmuth/J1506210001/T2.log (stat:
> Stale file handle)
> python-bi  3717      root    8w
> unknown
> /helga/public_html/tractor/cmd-logs/thelmuth/J1506210001/T9.log (stat:
> Stale file handle)
> python-bi  3717      root   12w
> unknown
> /helga/public_html/tractor/cmd-logs/thelmuth/J1506210001/T14.log (stat:
> Stale file handle)
> python-bi  3717      root   16w
> unknown
> /helga/public_html/tractor/cmd-logs/thelmuth/J1506210001/T24.log (stat:
> Stale file handle)
> [root at compute-1-4 ~]# top
>
> top - 16:32:23 up 14 days,  7:21,  1 user,  load average: 33.77, 35.33,
> 35.73
> Tasks: 230 total,   1 running, 229 sleeping,   0 stopped,   0 zombie
> Cpu(s): 50.2%us,  0.2%sy,  0.0%ni, 49.6%id,  0.0%wa,  0.0%hi,  0.0%si,
> 0.0%st
> Mem:  15928760k total,  8015840k used,  7912920k free,   238044k buffers
> Swap:  2047992k total,        0k used,  2047992k free,  1094888k cached
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
> COMMAND
>  8428 thelmuth  20   0 6089m 1.5g  11m S 254.9 10.1   2809:37
> java
>  8499 thelmuth  20   0 6089m 1.5g  11m S 243.6 10.1   2829:15
> java
>  8367 thelmuth  20   0 6089m 1.5g  11m S 236.0 10.0   2829:42
> java
>  8397 thelmuth  20   0 6089m 1.5g  11m S 28.3 10.1   2787:46
> java
>   737 root      20   0 15036 1220  840 R  1.9  0.0   0:00.01
> top
>  1536 root      20   0     0    0    0 S  1.9  0.0  10:24.07
> kondemand/6
>     1 root      20   0 19344 1552 1232 S  0.0  0.0   0:00.46
> init
>     2 root      20   0     0    0    0 S  0.0  0.0   0:00.02
> kthreadd
>     3 root      RT   0     0    0    0 S  0.0  0.0   0:00.53
> migration/0
>     4 root      20   0     0    0    0 S  0.0  0.0   0:02.85
> ksoftirqd/0
>     5 root      RT   0     0    0    0 S  0.0  0.0   0:00.00
> migration/0
>     6 root      RT   0     0    0    0 S  0.0  0.0   0:01.50
> watchdog/0
>     7 root      RT   0     0    0    0 S  0.0  0.0   0:00.49
> migration/1
>     8 root      RT   0     0    0    0 S  0.0  0.0   0:00.00
> migration/1
>     9 root      20   0     0    0    0 S  0.0  0.0   0:01.85
> ksoftirqd/1
>    10 root      RT   0     0    0    0 S  0.0  0.0   0:01.21
> watchdog/1
>    11 root      RT   0     0    0    0 S  0.0  0.0   0:00.50
> migration/2
>    12 root      RT   0     0    0    0 S  0.0  0.0   0:00.00
> migration/2
>    13 root      20   0     0    0    0 S  0.0  0.0   0:02.12
> ksoftirqd/2
>    14 root      RT   0     0    0    0 S  0.0  0.0   0:01.06
> watchdog/2
>    15 root      RT   0     0    0    0 S  0.0  0.0   0:01.81
> migration/3
>    16 root      RT   0     0    0    0 S  0.0  0.0   0:00.00
> migration/3
>    17 root      20   0     0    0    0 S  0.0  0.0   0:00.50
> ksoftirqd/3
>    18 root      RT   0     0    0    0 S  0.0  0.0   0:00.87
> watchdog/3
>    19 root      RT   0     0    0    0 S  0.0  0.0   0:01.76
> migration/4
>    20 root      RT   0     0    0    0 S  0.0  0.0   0:00.00
> migration/4
>    21 root      20   0     0    0    0 S  0.0  0.0   0:04.57
> ksoftirqd/4
>    22 root      RT   0     0    0    0 S  0.0  0.0   0:01.03
> watchdog/4
>    23 root      RT   0     0    0    0 S  0.0  0.0   0:00.51
> migration/5
>    24 root      RT   0     0    0    0 S  0.0  0.0   0:00.00
> migration/5
>    25 root      20   0     0    0    0 S  0.0  0.0   0:02.27
> ksoftirqd/5
> [root at compute-1-4 ~]# umount -f /helga
> umount2: Device or resource busy
> umount.nfs: /helga: device is busy
> umount2: Device or resource busy
> umount.nfs: /helga: device is busy
> [root at compute-1-4 ~]# umount -f /helga
> umount2: Device or resource busy
> umount.nfs: /helga: device is busy
> umount2: Device or resource busy
> umount.nfs: /helga: device is busy
> [root at compute-1-4 ~]# mount -o remount /helga
> mount.nfs: Stale file handle
> [root at compute-1-4 ~]# mount -o remount /helga
> mount.nfs: Stale file handle
> [root at compute-1-4 ~]# top
>
> top - 16:33:19 up 14 days,  7:22,  1 user,  load average: 36.17, 35.76,
> 35.86
> Tasks: 229 total,   1 running, 228 sleeping,   0 stopped,   0 zombie
> Cpu(s): 99.6%us,  0.4%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,
> 0.0%st
> Mem:  15928760k total,  8016408k used,  7912352k free,   238044k buffers
> Swap:  2047992k total,        0k used,  2047992k free,  1094888k cached
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
> COMMAND
>  8499 thelmuth  20   0 6089m 1.5g  11m S 209.2 10.1   2831:08
> java
>  8397 thelmuth  20   0 6089m 1.5g  11m S 198.9 10.1   2789:37
> java
>  8367 thelmuth  20   0 6089m 1.5g  11m S 196.2 10.0   2831:37
> java
>  8428 thelmuth  20   0 6089m 1.5g  11m S 192.9 10.1   2811:28
> java
>   764 root      20   0 15164 1352  952 R  0.7  0.0   0:00.03
> top
>  1532 root      20   0     0    0    0 S  0.3  0.0  20:31.35
> kondemand/2
>  1534 root      20   0     0    0    0 S  0.3  0.0  12:32.23
> kondemand/4
>  1536 root      20   0     0    0    0 S  0.3  0.0  10:24.14
> kondemand/6
>  1537 root      20   0     0    0    0 S  0.3  0.0  17:00.03
> kondemand/7
>  3717 root      20   0  341m  12m 3000 S  0.3  0.1  38:24.92
> python-bin
>     1 root      20   0 19344 1552 1232 S  0.0  0.0   0:00.46
> init
>     2 root      20   0     0    0    0 S  0.0  0.0   0:00.02
> kthreadd
>     3 root      RT   0     0    0    0 S  0.0  0.0   0:00.53
> migration/0
>     4 root      20   0     0    0    0 S  0.0  0.0   0:02.85
> ksoftirqd/0
>     5 root      RT   0     0    0    0 S  0.0  0.0   0:00.00
> migration/0
>     6 root      RT   0     0    0    0 S  0.0  0.0   0:01.50
> watchdog/0
>     7 root      RT   0     0    0    0 S  0.0  0.0   0:00.49
> migration/1
>     8 root      RT   0     0    0    0 S  0.0  0.0   0:00.00
> migration/1
>     9 root      20   0     0    0    0 S  0.0  0.0   0:01.85
> ksoftirqd/1
>    10 root      RT   0     0    0    0 S  0.0  0.0   0:01.21
> watchdog/1
>    11 root      RT   0     0    0    0 S  0.0  0.0   0:00.50
> migration/2
>    12 root      RT   0     0    0    0 S  0.0  0.0   0:00.00
> migration/2
>    13 root      20   0     0    0    0 S  0.0  0.0   0:02.12
> ksoftirqd/2
>    14 root      RT   0     0    0    0 S  0.0  0.0   0:01.06
> watchdog/2
>    15 root      RT   0     0    0    0 S  0.0  0.0   0:01.81
> migration/3
>    16 root      RT   0     0    0    0 S  0.0  0.0   0:00.00
> migration/3
>    17 root      20   0     0    0    0 S  0.0  0.0   0:00.50
> ksoftirqd/3
>    18 root      RT   0     0    0    0 S  0.0  0.0   0:00.87
> watchdog/3
>    19 root      RT   0     0    0    0 S  0.0  0.0   0:01.76
> migration/4
>    20 root      RT   0     0    0    0 S  0.0  0.0   0:00.00
> migration/4
>    21 root      20   0     0    0    0 S  0.0  0.0   0:04.57
> ksoftirqd/4
> [root at compute-1-4 ~]# umount -f /helga
> umount2: Device or resource busy
> umount.nfs: /helga: device is busy
> umount2: Device or resource busy
> umount.nfs: /helga: device is busy
> [root at compute-1-4 ~]# mount /helga
> mount.nfs: Stale file handle
> [root at compute-1-4 ~]# logout
> Connection to compute-1-4 closed.
> [root at fly install]# ssh compute-2-17
> Rocks Compute Node
> Rocks 6.1.1 (Sand Boa)
> Profile built 23:42 25-Dec-2014
>
> Kickstarted 23:58 25-Dec-2014
> [root at compute-2-17 ~]# ls /helga
> ls: cannot access /helga: Stale file handle
> [root at compute-2-17 ~]# umount -f /helga
> umount2: Device or resource busy
> umount.nfs: /helga: device is busy
> umount2: Device or resource busy
> umount.nfs: /helga: device is busy
> [root at compute-2-17 ~]# lsof | grep /helga
> lsof: WARNING: can't stat() nfs4 file system /helga
>       Output information may be incomplete.
> python-bi 28615      root    8w
> unknown
> /helga/public_html/tractor/cmd-logs/hz12/J1504210040/T3.log (stat: Stale
> file handle)
> [root at compute-2-17 ~]#
>
>
>
> On 6/22/15 4:33 PM, Wm. Josiah Erikson wrote:
>
> Uh... shouldn't have been. I wonder if it was because I yanked its logfile
> place out from under it? I did not expect that to happen! It's back now,
> but it looks like all of your runs errored out. I'm so sorry!
>     -Josiah
>
>
> On 6/22/15 4:00 PM, Thomas Helmuth wrote:
>
>  Very cool!
>
>  It looks like tractor is down. Is that related to this move?
>
>  Tom
>
> On Mon, Jun 22, 2015 at 1:50 PM, Wm. Josiah Erikson <
> wjerikson at hampshire.edu> wrote:
>
>> Keg will be going down shortly, and coming back up as a harder, faster,
>> better, stronger version shortly as well, I hope :)
>>     -Josiah
>>
>>
>> On 6/11/15 9:53 AM, Wm. Josiah Erikson wrote:
>> > Hi all,
>> >     I'm pretty sure there is 100% overlap between people who care
>> > about fly and people who care about keg at this point (though there
>> > are some people who care about fly and not so much keg, like Lee and
>> > Tom - sorry), so I'm sending this to this list.
>> >     I have a new 32TB 14+2 RAID6 with 24GB of RAM superfast (way
>> > faster than gigabit) keg all ready to go! I would like to bring it up
>> > on Monday, June 22nd. It would be ideal if rendering was NOT happening
>> > at that time, to make my rsyncing life easier :) Any objections?
>> >     -Josiah
>> >
>>
>> --
>> Wm. Josiah Erikson
>> Assistant Director of IT, Infrastructure Group
>> System Administrator, School of CS
>> Hampshire College
>> Amherst, MA 01002
>> (413) 559-6091 <%28413%29%20559-6091>
>>
>>  _______________________________________________
>> Clusterusers mailing list
>> Clusterusers at lists.hampshire.edu
>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>
>
>
>
> _______________________________________________
> Clusterusers mailing listClusterusers at lists.hampshire.eduhttps://lists.hampshire.edu/mailman/listinfo/clusterusers
>
>
> --
> Wm. Josiah Erikson
> Assistant Director of IT, Infrastructure Group
> System Administrator, School of CS
> Hampshire College
> Amherst, MA 01002(413) 559-6091
>
>
>
> _______________________________________________
> Clusterusers mailing listClusterusers at lists.hampshire.eduhttps://lists.hampshire.edu/mailman/listinfo/clusterusers
>
>
> --
> Wm. Josiah Erikson
> Assistant Director of IT, Infrastructure Group
> System Administrator, School of CS
> Hampshire College
> Amherst, MA 01002(413) 559-6091
>
>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.hampshire.edu/pipermail/clusterusers/attachments/20150622/44549b9c/attachment-0001.html>


More information about the Clusterusers mailing list