make-like build tools for data?

Posted by miku on Programmers See other posts from Programmers or by miku
Published on 2012-11-23T06:34:28Z Indexed on 2012/11/23 11:15 UTC
Read the original article Hit count: 610

Filed under:

make

Make is a standard tools for building software. But

make decides whether a target needs to be regenerated by comparing file modification times.

Are there any proven, preferably small tools that handle builds not for software but for data? Something that regenerates targets not only on mod times but on certain other properties (e.g. completeness). (Or alternatively some paper that describes such a tool.)

As illustration: I'd like to automate the following process:

get data (e.g. a tarball) from some regularly updated source
copy somewhere if it's not there (based e.g. on some filename-scheme)
convert the files to different format (but only if there aren't successfully converted ones there - e.g. from a previous attempt - custom comparison routine)
for each file find a certain data element and fetch some additional file from say an URL, but only if that hasn't been downloaded yet (decide on existence of file and file "freshness")
finally compute something (e.g. word count for something identifiable and store it in the database, but only if the DB does not have an entry for that exact ID yet)

Observations:

there are different stages
each stage is usually simple to compute or implement in isolation
each stage may be simple, but the data volume may be large
each stage may produce a few errors
each stage may have different signals, on when (re)processing is needed

Requirements:

builds should be interruptable and idempotent (== robust)
when interrupted, already processed objects should be reused to speedup the next run
data paths should be easy to adjust (simple syntax, nothing new to learn, internal dsl would be ok)
some form of dependency graph, that describes the process would be nice for later visualizations
should leverage existing programs, if possible

I've done some research on make alternatives like rake and have worked a lot with ant and maven in the past. All these tools naturally focus on code and software build, not on data builds. A system we have in place now for a task similar to the above is pretty much just shell scripts, which are compact (and are a ok glue for a variety of other programs written in other languages), so I wonder if worse is better?

Developer IT

make-like build tools for data? - Developer IT

make-like build tools for data?

open-source

data

builds

process

make

Related posts about open-source

Introducing Java/Open Source Daily, All the Java/Open Source Information You Need in One Newsletter

Introducing Java/Open Source Daily, All the Java/Open Source Information You Need in One Newsletter

The Open Source Open Source Conference

Open Source: Open Source Contributions

What are the strategies to become a good open source developer?

Related posts about data

timetable in a jTable

Reading data from an Entity Framework data model through a WCF Data Service

SQL SERVER – Advanced Data Quality Services with Melissa Data – Azure Data Market

Modifying a HTML page to fix several "bugs" add a function to next/previous on a option dropdown

Shrinking TCP Window Size to 0 on Cisco ASA

Categories cloud