Data cleaning~ My specialty. The gift that God gave me.
I did not know that I had data cleaning skills. But after many years, cleaning data and helping others with data, I am very thankful that I have some kind of a gut feeling for data.
Today, like always, I was cleaning data.
And I found errors in the data that we bought! More than a few.
These errors could have been easily verified with merging another data set.
Then there were errors with missing identifiers. This was interesting. I am conjecturing
that many students or people worked on this data set and somewhere along somethings
got lost. Anyhow, now I am filling in the data and hopefully, when the time is right, I
can share the data.
So after many years of data cleaning, here are some things that I learned.
- Always check the data. Just because everyone uses it that does not mean the data is clean. Lot of times, people do not check the data and put the responsibility on the data provider. That does not seem to be the best way to handle things. Because it is like the chef saying I don’t know anything about the ingredients but I cook really well. Yeah…
- Do sanity checks. This simple and quick. Check for missing values. Check for extreme values. See the distribution. For financial firms, see what type of firms are covered, what types of firms are missing. Does the missing firms create any bias?
- Link with the source. Not all data can be linked back to the source but lot of financial data can be checked by SEC filings or simple googling. Check! Check!
- Always save the original file. This is really weird but sometimes people do not save the original. Always, save the original so you can compare after cleaning, also to use it in case, you mess up while cleaning data.
- Document! Make sure to record how you cleaned the data. What observations were added or excluded. How extreme values were handled… Never skip on the details.
- Have someone else check the data too. Data cleaning is hard and easy to make mistakes. Have others also check your data.
- Cockroach rule! While checking the data, if you find an error. Check again. Check more data. Do this until, you don’t find any more errors. Usually, one error is not random. It seems to be systematic. For instance, I am matching company names to SEC filings. I find that one of the guys who reported the data to the government is not a good speller….
Hope this seven rules are helpful! Also, remember that sharing is caring please do share your data. It also helps advance science and save lot of money and energy.
PS
I plan to start sharing data as soon as I wrap up my projects 😀