File Sync Solution for Batch Processing (ETL)

Posted by KenFar on Super User See other posts from Super User or by KenFar
Published on 2012-06-07T04:45:48Z Indexed on 2012/06/09 16:42 UTC
Read the original article Hit count: 274

Filed under:
|

I'm looking for a slightly different kind of sync utility - not one designed to keep two directories identical, but rather one intended to keep files flowing from one host to another.

The context is a data warehouse that currently has a custom-developed solution that moves 10,000 files a day, some of which are 1+ gbytes gzipped files, between linux servers via ssh. Files are produced by the extract process, then moved to the transform server where a transform daemon is waiting to pick them up. The same process happens between transform & load. Once the files are moved they are typically archived on the source for a week, and the downstream process likewise moves them to temp then archive as it consumes them. So, my requirements & desires:

  • It is never used to refresh updated files - only used to deliver new files.
  • Because it's delivering files to downstream processes - it needs to rename the file once done so that a partial file doesn't get picked up.
  • In order to simplify recovery, it should keep a copy of the source files - but rename them or move them to another directory.
  • If the transfer fails (network down, file system full, permissions, file locked, etc), then it should retry periodically - and never fail in a non-recoverable way, or a way that sends the file twice or never sends the file.
  • Should be able to copy files to 2+ destinations.
  • Should have a consolidated log so that it's easy to find problems
  • Should have an optional checksum feature

Any recommendations? Can Unison do this well?

© Super User or respective owner

Related posts about file

Related posts about sync