First let’s start with stating the problem with existing writing on “Data Cleaning”.
Wikipedia's post on data cleaning does a decent summary of the big important qualities of data quality: Validity, Accuracy, Completeness, Consistency, Uniformity. It’s also got a section on “process” that’s really dry and academic (in a negative way) and won’t help you clean any data at all.
Next I’m just gonna sample posts from the top links on Google when I search “Data cleaning”. I’ll provide links as reference so you know what I’m griping about.
This highly PageRanked one is like a friendlier expansion of the Wikipedia page at the start. Luckily it redeems itself in the process section by listing a big list of example techniques to use to clean data, things like cleaning spaces, dropping irrelevant values, etc. Has some examples and illustrations!. read more...
The standard Perl distribution comes with a debugger, although it's really just another Perl program, perl5db.pl. Since it is just a program, I can use it as the basis for writing my own debuggers to suit my needs, or I can use the interface perl5db.pl provides to configure its actions. That's just the beginning, though. read more...
Comments