Skip to main content

Data Cleaning IS Analysis, Not Grunt Work.

First let’s start with stating the problem with existing writing on “Data Cleaning”. Wikipedia's post on data cleaning does a decent summary of the big important qualities of data quality: Validity, Accuracy, Completeness, Consistency, Uniformity. It’s also got a section on “process” that’s really dry and academic (in a negative way) and won’t help you clean any data at all. Next I’m just gonna sample posts from the top links on Google when I search “Data cleaning”. I’ll provide links as reference so you know what I’m griping about. This highly PageRanked one is like a friendlier expansion of the Wikipedia page at the start. Luckily it redeems itself in the process section by listing a big list of example techniques to use to clean data, things like cleaning spaces, dropping irrelevant values, etc. Has some examples and illustrations!. read more...

Comments

Popular posts from this blog

Debugging Perl

The standard Perl distribution comes with a debugger, although it's really just another Perl program, perl5db.pl. Since it is just a program, I can use it as the basis for writing my own debuggers to suit my needs, or I can use the interface perl5db.pl provides to configure its actions. That's just the beginning, though. read more...

How To Set Up A Cisco Lab On Linux

After a quick search I found the wonderful Dynamips project that goes beyond what other simulators do by running actual Cisco IOS images, as well as the PEMU project which allows for running of Cisco PIX images. To integrate the various pieces of software... more .