Lustre - issues with simple setup

Posted by ethrbunny on Server Fault See other posts from Server Fault or by ethrbunny
Published on 2012-09-14T12:17:52Z Indexed on 2012/09/14 15:39 UTC
Read the original article Hit count: 255

Filed under:
|

Issue: I'm trying to assess the (possible) use of Lustre for our group. To this end I've been trying to create a simple system to explore the nuances. I can't seem to get past the 'llmount.sh' test with any degree of success.

What I've done: Each system (throwaway PCs with 70Gb HD, 2Gb RAM) is formatted with CentOS 6.2. I then update everything and install the Lustre kernel from downloads.whamcloud.com and add on the various (appropriate) lustre and e2fs RPM files. Systems are rebooted and tested with 'llmount.sh' (and then cleared with 'llmountcleanup.sh'). All is well to this point.

First I create an MDS/MDT system via:

/usr/sbin/mkfs.lustre --mgs --mdt --fsname=lustre --device-size=200000 --param sys.timeout=20 --mountfsoptions=errors=remount-ro,user_xattr,acl --param lov.stripesize=1048576 --param lov.stripecount=0 --param mdt.identity_upcall=/usr/sbin/l_getidentity --backfstype ldiskfs --reformat /tmp/lustre-mdt1

and then

mkdir -p /mnt/mds1    
mount -t lustre -o loop,user_xattr,acl  /tmp/lustre-mdt1 /mnt/mds1

Next I take 3 systems and create a 2Gb loop mount via:

/usr/sbin/mkfs.lustre --ost --fsname=lustre --device-size=200000 --param sys.timeout=20 --mgsnode=lustre_MDS0@tcp --backfstype ldiskfs --reformat /tmp/lustre-ost1   


mkdir -p /mnt/ost1     
mount -t lustre -o loop  /tmp/lustre-ost1 /mnt/ost1    

The logs on the MDT box show the OSS boxes connecting up. All appears ok.

Last I create a client and attach to the MDT box:

mkdir -p /mnt/lustre
mount -t lustre -o user_xattr,acl,flock luster_MDS0@tcp:/lustre /mnt/lustre    

Again, the log on the MDT box shows the client connection. Appears to be successful.

Here's where the issues (appear to) start. If I do a 'df -h' on the client it hangs after showing the system drives. If I attempt to create files (via 'dd') on the lustre mount the session hangs and the job can't be killed. Rebooting the client is the only solution.

If I do a 'lctl dl' from the client it shows that only 2/3 OST boxes are found and 'UP'.

[root@lfsclient0 etc]# lctl dl   
0 UP mgc MGC10.127.24.42@tcp 282d249f-fcb2-b90f-8c4e-2f1415485410 5   
1 UP lov lustre-clilov-ffff880037e4d400 00fc176e-3156-0490-44e1-da911be9f9df 4   
2 UP lmv lustre-clilmv-ffff880037e4d400 00fc176e-3156-0490-44e1-da911be9f9df 4   
3 UP mdc lustre-MDT0000-mdc-ffff880037e4d400 00fc176e-3156-0490-44e1-da911be9f9df 5   
4 UP osc lustre-OST0000-osc-ffff880037e4d400 00fc176e-3156-0490-44e1-da911be9f9df 5   
5 UP osc lustre-OST0003-osc-ffff880037e4d400 00fc176e-3156-0490-44e1-da911be9f9df 5   

Doing a 'lfs df' from the client shows:

[root@lfsclient0 etc]# lfs df  
UUID                   1K-blocks        Used   Available Use% Mounted on  
lustre-MDT0000_UUID       149944       16900      123044  12% /mnt/lustre[MDT:0]  
OST0000             : inactive device  
OST0001             : Resource temporarily unavailable  
OST0002             : Resource temporarily unavailable  
lustre-OST0003_UUID       187464       24764      152636  14% /mnt/lustre[OST:3]  

filesystem summary:       187464       24764      152636  14% /mnt/lustre  

Given that each OSS box has a 2Gb (loop) mount I would expect to see this reflected in available size.

There are no errors on the MDS/MDT box to indicate that multiple OSS/OST boxes have been lost.

EDIT: each system has all other systems defined in /etc/hosts and entries in iptables to provide access.

SO: I'm clearly making several mistakes. Any pointers as to where to start correcting them?

© Server Fault or respective owner

Related posts about centos6

Related posts about lustre