What are best practices for collecting, maintaining and ensuring accuracy of a huge data set?

Posted by Kyle West on Stack Overflow See other posts from Stack Overflow or by Kyle West
Published on 2010-12-22T01:38:05Z Indexed on 2010/12/22 1:54 UTC
Read the original article Hit count: 745

Filed under:

management

I am posing this question looking for practical advice on how to design a system.

Sites like amazon.com and pandora have and maintain huge data sets to run their core business. For example, amazon (and every other major e-commerce site) has millions of products for sale, images of those products, pricing, specifications, etc. etc. etc.

Ignoring the data coming in from 3rd party sellers and the user generated content all that "stuff" had to come from somewhere and is maintained by someone. It's also incredibly detailed and accurate. How? How do they do it? Is there just an army of data-entry clerks or have they devised systems to handle the grunt work?

My company is in a similar situation. We maintain a huge (10-of-millions of records) catalog of automotive parts and the cars they fit. We've been at it for a while now and have come up with a number of programs and processes to keep our catalog growing and accurate; however, it seems like to grow the catalog to x items we need to grow the team to y.

I need to figure some ways to increase the efficiency of the data team and hopefully I can learn from the work of others. Any suggestions are appreciated, more though would be links to content I could spend some serious time reading.

THANKS!

Kyle

Developer IT

What are best practices for collecting, maintaining and ensuring accuracy of a huge data set? - Developer IT

What are best practices for collecting, maintaining and ensuring accuracy of a huge data set?

database

data

management

Related posts about database

SQL SERVER Retrieve and Explore Database Backup without Restoring Database Idera virtual database

Cloning A Database On The Same Server Using Rman Duplicate From Active Database

cPickle ImportError: No module named multiarray

SQL SERVER – 2008 – Introduction to Snapshot Database – Restore From Snapshot

OTN ???? ?????? ???????

Related posts about data

timetable in a jTable

Reading data from an Entity Framework data model through a WCF Data Service

SQL SERVER – Advanced Data Quality Services with Melissa Data – Azure Data Market

Modifying a HTML page to fix several "bugs" add a function to next/previous on a option dropdown

Shrinking TCP Window Size to 0 on Cisco ASA

Categories cloud