Module 30 Cleaning messy data
- This is a review exercise: learn the skills introduced in the previous modules by applying them to a universal data science scenario: cleaning up messy data.
In the module on joining datasets, we introduced a dataset of whale diving behaviour:
head(dives) id species behavior prey.volume prey.depth dive.time surface.time 1 20140811106 HW FEED 6.914610 120.76 351.00 237 2 20140812104 HW FEED 7.854762 79.02 281.00 87 3 20140812107 HW FEED 7.385667 96.92 300.25 80 4 20140812109 FW FEED 6.626298 105.87 366.00 189 5 20140812131 HW OTHER 6.356474 123.95 357.00 112 6 20140812140 FW FEED 3.820782 125.51 408.00 182 blow.interval blow.number 1 26.833 10.000 2 14.412 6.667 3 16.000 6.000 4 16.273 12.000 5 25.250 6.000 6 18.789 11.000
This dataframe is nice and tidy. Here are a few of its many tidy features:
Each row is a single observation.
There is not a single missing value anywhere in the dataset.
The rows are organized from earliest to most recent, based on the data embedded in the
Categorical columns have standardized formatting. In the
speciescolumn, there are two levels:
HW(humpback whale) and
FW(fin whale). In the
behaviorcolumn, there are also two levels:
But this dataset was not always so pretty. Here is the link to the original data file:
Your task in this review exercise is to write a script that carries out the necessary data cleaning steps to get this dataset from its original form to its tidy form.
Test your work along the way, then demonstrate its completion, using the
identical() function. If your
my_dives version of the dataset is identical to the
dives data above, the following logical test will be