Types of raw datasets are called tables. There are 9 tables from the raw data:
Types of clean datasets are called panels. There are 4 panels in the clean data:
The original data source is downloaded and shared with the team at a point in time. We call each round of data extraction a
We started by going over all the files of type
.xml in the folder
Google Drive/MinnCCAccess/Analysis/Data/Raw. We identified the files that are from the Nware database. Then we classified them into two groups: supersets and subsets. A file is called superset if it has unique observations or columns that no other file in the folder has. A file is called subset if all the information in the file can be found in another superset file. If two or more files have the exact same information, only one of them is classified as superset, and all the others are labelled as subsets.
We went over every file by opening the files in Excel and checking their content. We wrote a log summarizing our findings on the contents of the file and their superset/subset status.
Then, we created a crosswalk table that contains information on the supersets of each file. Remember that a superset is a superset of itself. Then, we wrote a script that checks whether a subset is really a subset of its superset.
We also checked whether the crosswalk table contains all the Nware files in the raw folder.
We checked every file to determine what type of a table that file contains. We used this information to rename the files.
We renamed a copy of each superset and moved them to a new folder,
Data Cabinet/Nware/Step 1 Data. The naming convention is as follows. The first part of the name contains information on the type of data that the file contains. The types are
rate. The second part of the name is the ISO 8601-formatted date of the snapshot.
Then, we created a crosswalk table that contains the old and the new paths of the files. Then, we wrote a script that checks whether the old and the new files are actually the same. We manually checked the file names, by checking whether the file name actually represents the correct table type and the snapshot date.
In this step, we do two things. First, we had multiple periods of time for each table, since there were many snapshots. We should have merged the tables so that we have the data for each table for all dates in a single dataframe. Second, some variables changed tables over time. For example, a variable in
general table before 2017 started to appear in
provider_name table starting 2017. In addition, there were multiple tables with the same unit of observation. We decided to combine all the tables with the unit of observation into a single panel. We also change variable names in this step to camel case.
Renaming the variables before merging the files allowed us to avoid having multiple columns for the same variable (with a different name across snapshots).
The final versions of the panels include all the snapshots from various dates.
An important thing to note here is that May 2012 snapshot of General table has problematic rows. We dropped these rows before appending this file.
There are four different units of observations we found in the original tables.
provider_name are in the first group.
shift is in the second group.
rates are in the third group. Finally,
provider_schedule is in the fourth group. This means that we can actually have 4 tables instead of 9, by merging the tables with the same units of observations by using the correct primary/foreign keys.
It is important to note that
general also contains group 2, group 3 and group 4 -type data before 2017 in a very messy way. We will talk about how we take those variables and move them to their corresponding tables in the following sections.