GEOG 455W Project

#Methods

The raw data consisted of 690,000 randomly collected geotagged tweets across the City of Vancouver generated by TwitterAPI. There is no organization or any pattern in this giant dataset. The aim for us was to use these tweets to explore the eating habits and potential obesity risks lying in the people of the City of Vancouver.

LEXICON
Tool Used: Wordnet, Twitter, Google Knowledge Graph

During this large process of analyzing Big Data, the first and most important step in this project was to create a keyword searching lexicon. With this keyword lexicon, we could then extract or “massage” the raw data to a refined categorized dataset. Synonyms were our first consideration. We listed a range of words associated with fat, emotion and skinny and used the online tool “WordNet” to look for any relative terms. WordNet is a large lexical database of English language that groups words together based on their meanings into sets of cognitive synonyms (WordNet, 2015). It considers both formal and informal terms which serve to be particularly effective when analyzing Twitter data. These sets of synonyms are called synsets and are able to link other synsets by means of a small number of “conceptual relations” (WordNet, 2015) thereby producing an even larger relevant lexicon in a short period of time.
Smiley face
Figure 1. WordNet search example

In addition to the emotion and body image based synsets, the next step was to look for food-related words. We used the Google Knowledge Graph to search for unhealthy foods (e.g. Coke, ice-cream, food with high sugar), healthy foods (e.g. kale, cabbage, types of vegetables) and names of fast food restaurants (e.g. KFC, MacDonald, etc.). This was effective in starting the lexicon with key words which were then searched using Twitter to see if tweets containing those words were indicative of consumption patterns. Any word that was not qualified was eliminated from the database. More word terms could also be found during this validation process that were not captured using the previous methods.
Smiley face
Figure 2. Search keyword lexicon created by Excel (Part 1)

Figure 3. Search keyword lexicon created by Excel (Part 2)

WORD EXTRACTION
Tool Used: Excel
After creating a complete lexicon database, word extraction was the next big stage. We divided our data into three categories:

Healthy tweets
Unhealthy tweets
Tweets that are associated with negative emotions

Under the the healthy tweets category, types of healthy food and words describing a slim body were included while the unhealthy tweets category included unhealthy food, fast food restaurant names and words that are related to obesity or fat. With this targeted keyword lists, a query in Excel was then used to find all messages that contained one of those keywords.
Command SUMPRODUCT(--ISNUMBER(SEARCH(things,A1)))>0 returned TRUE when targeted words were found in each tweets and returned FALSE, otherwise. All the data with a value of TRUE were filtered out and exported to ArcMap for further analysis.
ANALYSIS + DATA VISUALIZATION
Tool Used: ArcMap 10.3.1 The Data which contains coordinate form location information was first exported with the coordinate system being WGS 1984 and then projected using the ArcMap “Project” tool to transfer its coordinate system to projected NAD 1983 UTM ZONE 10N. The City of Vancouver boundary with neighbourhood division was added together to the point data with the additional information of total population in each neighbour. There were two spatial analysis method being performed: directional distribution and kernel density. Directional distribution analyses was conducted on the three categories to show trends of healthy, unhealthy eating habits and negative emotions for the purpose of comparison. Kernel density analysis showed the cluster pattern of each type of tweet (healthy and unhealthy) and was performed using a 100m2 search radius and classified by 7 natural divisions in the clusters (Natural Jenks). Another important step in the analysis is to show the rate of Twitter users over the total population. The tweet table was spatially joined with the boundary shapefile and the number of tweets within each neighbourhood were counted. Within the attribute table, a field calculator was the used to calculate the rate of Twitter usage.
There were multiple series of analyses being conducted in this project. We avoided the computer algorithm with the replacement of an Excel command and worked from there toward a shift from data mining to GIS. The creation of a keyword lexicon, the method of extracting useful messages and the stages of performing spatial analysis are all part of our big GIS analysis. It is shown that the essential procedure during all our processes focused on the data exploration part. It is the same within all GIS fields because GIS is taking data to find patterns. If the data was problematic, everything beyond that could fall apart, however, the real world is always more complex than the computer world. With this analysis being close related to human convention, it is extremely hard for a computer to accurately and completely capture all of the words used by human in the right context.