Step 1: Go to https://open.alberta.ca/opendata/deaths-cause-by-gender-and-age
Step 2: Download a file named deaths-by-gender-and-age00.csv by clicking the bottom download button or the included link.
Step 3: Open in a spreadsheet program like Excel. I opened it using OpenOffice, which is open source, so it’s free.
The first thing I saw that every thing was laid out neatly and it was easy to spot the system.
Columns run from A to Z. Perfect. Very Organized.
year, cause, gender, 19 age categories spanning 5 years each with the exception of the 90+ category, which covers everyone who makes it there.
Then there’s people who they didn’t know how old they were when they died,
whether they died at home or not
and a totals column.
Step 4: Looking to the rows, every year for every cause they’ve recorded 3 values for each cause of death.
- Male total
- female total
- Grandtotal
with each value recorded under the appropriate age column.
Step 5: But all is not as it seems. Note that the government had appended a category number to each cause name and separated it from the cause name with a space. Given that .csv in the filename stands for comma separated values, this means the creator of the file did not want the the ID# and the description of each cause to be separate.
Step 6: Find out why. This is why we downloaded the .csv version of the data so we can import it in a better database program for analysis. I imported the file into a mySql database that looks like this:
Step 7: Once it’s been imported into mySQL, you can start querying the data. First finding: There are 2290 “unique” category names. This can be reduced by just eliminating extra spaces, leaving us with 1903 unique causes of death.
Step 8: This is where you’ll realize that the cause ID#’s do not remain the same throughout the data.
Step 8: Separate the cause ID# from its name. This still leaves us with well over 1000 unique causes of death. Now we can start fixing some of the carelessness of those responsible for creating this document.
This reduces the total # of causes of death to 439.
Step 9: Deal with naming inconsistencies across years. I started by finding out how many categories there were per year.
Step 10: Discover only two different names being used for unknown( I think – which we’ll get to in step 11)
Before:
After: (Shows columns updated appropriately)
Step 11: Establish a public forum where any medically astute truth-seekers can take a look at somehow standardizing….click here