Data + Methodology

Information on the data received, and the project's methodology.

Project data

The data for our project was obtained through Dr. Nadine Schuurman courtesy of Dr. Louise Masse. We received 94 Excel datasets for our project, representing data from 94 study participants who are children. Notably, the data in the columns included timestamps, longitude and latitude coordinates, accelerometer activity counts and intensity levels, GPS speed, reported trip mode, and activity description. Due to the inclusion of sensitive information in the form of longitude and latitude coordinates, we had to comply with an ethics requirement wherein we were to only access, use and analyze the data on campus at Simon Fraser University to ensure data protection.

Methods

The methodology for our project can be split up into three sections. First, the methodology used for activity prediction based on accelerometer and GPS data (Figure 1). Second, the methodology used to validate the accuracy of activity predictions we make (Figure 2). Third, the methodology used to create an anonymized map showcasing the locational movements of study participants within the Metro Vancouver region (Figure 3).

Methodology used for activity prediction

Figure 1: Methodology used for activity prediction.

In terms of the methodology used for activity prediction (Figure 1), it is important to note that while we started off with the 94 Excel datasets received via Dr. Nadine Schuurman, we made the decision to remove two of the Excel datasets because the files had names suggesting bad accuracy data. In particular, the files we deemed to contain bad accuracy data were named "CSV_1005_Final_(Day7GPS-BadAccData)" and "CSV_2031_FINAL_(BadACC)."

We then proceeded to merge the 92 Excel datasets into six Excel datasets for ease of use. To ensure the Excel files were manageable, especially with regard to the amount of time it takes to load a dataset, we deleted data columns that we deemed irrelevant to our project.

Shifting to the central part, we were able to create an Excel function that classified each fifteen-second interval entry into one of four activity predictions based on accelerometer and GPS data. These four activity predictions include: "Activity In Constrained Area," "Sedentary Transportation," "Active Transportation," and "Stationary." The data we were given came with two data columns dealing with accelerometer data. One measures the actual activity counts, whereas the other classifies the activity counts into intensity levels ranging from 0 for non-activity to 3 for intense activity. For simplicity, we decided to utilize the classified "ActivityIntensity" column as the target input for accelerometer data in our Excel formula. In addition, we used the "Speed" column which measures change in location of a study participant over time as the target input for GPS data in our Excel formula.

If the accelerometer recorded an activity intensity level greater than 0 while the GPS speed was less than 1, we deemed it to be "Activity In Constrained Area." It is a fair assumption that an active person will show an activity intensity level of at least 1, and will not be moving fast spatially in a constrained area. If the accelerometer recorded an activity intensity level of 0 while the GPS speed was greater than 10, we deemed it to be "Sedentary Transportation." It is a fair assumption that a person in a sedentary form of transportation such as cars, buses, or rapid transit will not show activity, and will be moving fast spatially. If the accelerometer recorded an activity intensity level greater than or equal to 1 while the GPS speed was greater than or equal to 1, we deemed it to be "Active Transportation." It is fair assumption that a person in an active form of transportation such as walking, running, cycling, or scootering will show an activity intensity level of at least 1, and will be moving slow spatially. All other values were deemed to be "Stationary" since they would involve an activity intensity level equal to 0 while the GPS speed was equal to 0. It is a fair assumption that a stationary person will not be showing activity, and will not be moving spatially. Below is what our Excel formula looked like after incorporating the above statements:

=IF(AND(I2<1,E2>0),"Activity In Constrained Area",IF(AND(I2>10,E2=0),"Sedentary Transportation",IF(AND(I2>=1,E2>=1),"Active Transportation","Stationary")))

Methodology used to validate accuracy of activity predictions

Figure 2: Methodology used to validate accuracy of activity predictions.

In terms of the methodology used to validate the accuracy of activity predictions (Figure 2), we first created a function that classified the "TripMode" column into the same activity predictions as our "Predicted" column for validation purposes. Many entries in this column had a value of "Unknown." Therefore, the reclassified "Reported" column includes the four activity predictions as well a value for "Unknown."

The "TripMode" column was chosen to validate the accuracy of our activity prediction because it did not have too many unique values. As such, we could easily attribute values present in the column to one of the four activity predictions. For example, values such as "vehicle" and "public transit" were attributed to "Sedentary Transportation" and values such as "scooter" and "pedestrian" were attributed to "Active Transportation." Below is the Excel formula we utilized to reclassify the "TripMode" column:

=IF(AND(J2="unknown"),"Unknown",IF(AND(J2="bicycle"),"Active Transportation",IF(AND(J2="vehicle"),"Sedentary Transportation",IF(AND(J2="stationary"),"Stationary",IF(AND(J2="public transit"),"Sedentary Transportation",IF(AND(J2="scooter"),"Active Transportation",IF(AND(J2="pedestrian"),"Active Transportation","Failed")))))))

With the "TripMode" reclassified into the "Reported" column, we created yet another Excel function comparing the "Predicted" column against the "Reported" column to determine accuracy. If the values for both columns were the same for a particular entry, the "Accuracy" column would have a value of "Accurate." If not, it would have a value of "Inaccurate." However, if the "Reported" column had a value of "Unknown," the "Accuracy" column would have a value of "Unknown." Below is the Excel formula we utilized to compare the "Predicted" and "Reported" data columns:

=IF(AND(Q2="Unknown"), "Unknown",IF(AND(P2=Q2), "Accurate", "Inaccurate"))

Utilizing the resulting values in the "Accuracy" column, we created two graphs using Excel PivotTables to show summary statistics. Specifically, we created a bar graph to show the numerical counts of the values of "Accurate," "Inaccurate," and "Unknown." We also created a pie chart showing the occurrence percentage of the values "Accurate" and "Inaccurate" out of the total values for the two. We did this because the value of "Unknown" for the "Accuracy" column can not be used for or against the accuracy of our activity predictions.

Methodology used to create an anonymized map of locational movements

Figure 3: Methodology used to create anonymized map of locational movements.

In terms of the methodology used to create an anonymized map showcasing the locational movements of study participants within the Metro Vancouver region (Figure 3), we first converted the six Excel datasets into the .csv format so it could easily be imported into ArcMap. The data was then displayed geographically in ArcMap using latitude and longitude coordinates. The datasets were subsequently merged into a singular shapefile and stored in a personal geodatabase to ensure faster processing.

Since some of the locational movement points were located in areas outside of the Metro Vancouver region, we utilized the "Select by Location" tool in ArcMap to extract point features that were within the boundary for the Metro Vancouver region. We then exported the selected these point features to a new shapefile.

With regard to data anonymization, we decided to display every point of the aggregated data within the boundary for the Metro Vancouver region so that the locational movement trajectories of individual children would largely get lost in the crowd of points. Furthermore, we decided to display the points alongside the boundary of the Metro Vancouver region, notably without any other basemap. This was a simple step that will make it difficult for people to know the exact location of a study participant. Last, but certainly not of least importance, we decided to rasterize the points within a grid network of large raster squares so that the locational movement of study participants would be displaced. The map reader will therefore be uncertain about where the study participant was exactly located within any of the large raster squares.