Initially this took about ~7hours to diagnose and fix, with what I have learned about the inner workings of gluster and the tools I am providing opensource this should cut resolution time down to ~5minutes.

Firs you must meet the following conditions:

  1. You are running gluster >= 3.0 <= 3.2 (May also work on 2.x I have not tested, and will not work with future versions if gluster change their use of xattrs)
  2. You are running a replicated volume (Again I have not tested distributed volumes, in theory remove, re-add and rebalance will fix these)
  3. You have a “good” copy of you data (This is essential this assume you have at least 1 brick with a good copy of the file system

Restrain and restore the “bad” brick

  1. Shutdown all services that are using the mounted filesystem (i.e. httpd / nginx / *ftpd)
  2. Unmount all the file systems on the node (glusterfs / nfs / etc …)
  3. Grab a copy of stripxattr.py make sure you READ the README for installation requirements and usage
  4. Run stripxattr.py against the backing filesystem on the “bad” node ONLY NOT AGAINST A GLUSTER MOUNT
  5. From the “good” node, not rsync the data: rsync -gioprtv –progress /path/to/filesystem root@:/path/to
  6. From the “good” node, trigger an “auto heal” this will re-populate the xattr data (this must be done on a glusterfs mount not nfs/cifs/etc…)
  7. Download listxattr.py once the self heal has completed see the README file for a “quick and dirty” consistency check
  8. All being well you have now resolved a split-brain and can return your node to service

Current known gluster issues

  1. NFS is much (48x in tests) faster for small files i.e. php webapps, but does not support distributed locking meaning: all nodes can write to the same file at the same time, this is what cause our original split brain

So what is the resolution int his case?

Selective use, use glusterfs for filesystems that you need distributed locking, often in large production deploys php files will not change often, in this case NFS is perfect.

If you are still writing php sessions to a file system then STOP IT and use a database! (Better yet use memcache).

Tags: , , , ,
  • http://joejulian.name Joe Julian

    The lack of nfs locking support did not cause your split-brain situation. It would have equally corrupted all replica. If it happens again, check your logs for disconnections, then come see us in #gluster we’ll see if we can help you diagnose it.

    Btw, I wouldn’t recommend the rsync method. You’ll get unpredictable results, especially with a distributed replica. If you really want to just resync the entire brick, make sure glusterfsd isn’t running for that brick and wipe it. Then start the brick again (gluster volume start {volname} force) then perform the normal self-heal process.

    Really, though, you only need to remove one copy of an offending split-brained file. To help identify files with pending self-heals, this tool can be handy: http://joejulian.name/blog/quick-and-dirty-python-script-to-check-the-dirty-status-of-files-in-a-glusterfs-brick/

  • http://www.saiweb.co.uk Buzz

    In this case the lack of distributed locking was the cause due to multiple subversion metafiles all being updated on each node independently, leading to the split-brain.

    I’m in no way blaming Gluster here, for one I loathe the fact SVN does this, but for legacy deploys I have to use it.

    The rsync method is ment for when you know you have a “known good copy”, sort of the same scenario you have in a DRBD split brain, self heal was not working and nor was rsync with glusterd off (the xattrs persisted).

    Now this post is a late write up, this ironically occurred during the transition to red hat meaning … ALL the docs were offline as per my tweets, and I only came across the xattrs after experimenting reading xattrs for another project.
    Also I did drop by IRC under the handel: Oneiroi to no avail, could of been due to the red hat transition or timezone but this was the one and only solution the the issue I was having.

    Thanks for the script link, could you elaborate a bit on what identified the trusted.afr as “dirty” ?