GEOG 455W Project

#Background

This project has two main focuses. The first focus is data exploration and the second is data analysis. Natural Language Processing is rooted in the field of Computer Sciences wherein large amounts of data can be process efficiently and accurately using code (e.g., Python, C++). Given the large amount of data requiring processing and the utilization of geospatial characteristic of twitter data (coordinates provided with each tweet) in this project, the following literature review evaluates the two academic areas in this project’s multidisciplinary approach: Computational Linguistics and GISciences. It is necessary and important for us to review research within Computational Linguistics not only to have a better understanding of the existing method for exploring large datasets but also to realize the limitations of existing processing methods. Extracting lexical information from the large number of tweets can be enriched through the spatial analysis of the data and GIS hosts great potential for such enrichment to take place.

GIS 2.0 This intersection of social media and GIS have spurred new developments in expanding the field of GIS. An example of this expansion is Volunteer GIS (VGIS) and Public Participation GIS (PPGIS); both of which can be categorized under the title of GIS 2.0 (Elwood, 2009),(Haklay, Singleton, & Parker, 2008). GIS 2.0 is a name given to new developments with the defining characteristics of having modes of user-generated content and interactivity. It differs from the traditional paradigm of the state building and holding access rights to the databases and maps of their GIS. GIS 2.0 is democratization of data, with people functioning as the sensors and the indicators of social trends and phenomenon. These modern factors of democratization and social media allow our project to exist compared to the state manually surveying individuals’ eating habits, and then displaying it spatially. With the innovations of GIS 2.0, our data is perpetually produced, intrinsically dynamic and easy to monitor. Our project aims to i) identify common words and traits within the plethora of tweets within Vancouver to discern and identify recurrent feelings towards healthy and unhealthy eating habits, ii) spatially display the tweets to recognize patterns in relation to neighbourhoods in Vancouver, including a further step of overlaying transit maps to correlate these eating habits with transportation corridors, and iii) analyze these outcomes to draw conclusions about the processes and ontologies of the foodscape of Vancouver.

Big Data + GIS In ‘Big Data and Cloud Computing: Innovation Opportunities and Challenges’ by Yang et al. (2017) the term ‘Big Data’ refers to the flood of digital data from many digital earth sources, including sensors, …, e-mails and social networks”. Such data can be directly or indirectly related to geospatial information. Twitter data, with its numerous contributors and geotagged entries (tweets), is then by definition Big Data. Through the outlined processes for Big Data (data processing, data mining, and knowledge discovery methods) this provides key steps in data processing applicable to Twitter data which can be found in the article “A data-driven framework for archiving and exploring social media data” by Huang & Xu (2014). A general strategy and develops a conceptual model about how to query and process massive datasets. Huang & Xu (2014) emphasize the importance of social media data due to its potential usage for detecting events and indicating societal situations. The term ‘citizen-as-sensors’ is used to explain idea that Human Actors in a connected environment when augmented with ubiquitous mechanical sensory system, can form the most intelligent sensory web (Huang & Xu, 2014). Using Twitter as an example, three collaborative strategies in forming a sensory web are as follows: Data sets are archived as different collections in the database (DB); Parallel computing is applied to harvest, query, and analyze tweets to or from different collections simultaneously; data is duplicated across multiple servers to support massive concurrent access of the data sets (Huang & Xu, 2014). In addition to statistical analysis, machine learning, and data mining, Yang et al.’s (2017) article provides insights into GIS technology. Du et al. (2016) introduced an interactive visual approach to detect spatial clusters from emerging spatial Big Data (e.g. geo-referenced social media data) using dynamic density volume visualization in a 3D space (two spatial dimensions and a third temporal or thematic dimension of interest)(as cited in Yang et al., 2017). Lary et al. (2014) has presented a holistic system called Holistics 3.0 that combines multiple big datasets and massively multivariate computational techniques, such as machine learning, coupled with geospatial techniques, for public health (as cited in Yang et al., 2017). In order to extract meaningful information from massive data, it is seen that much more effort should be devoted to develop comprehensive libraries and tools that are easy to use, capable of mine- massive, multidimensional data. This corresponds to the Sui and Goodchild (2011) article, “The Convergence of GIS and Social Media: Challenges for GIScience” which focuses on the shortcomings of GIS in terms of spatial analysis of social media data. Another example exploring the unison of GIS and Social media is Yamada and Yamamoto’s Development of Social Media GIS for Information Exchange between Regions (2013). Looking at a rural, car orientated area, the eastern part of Yamanashi Prefecture, the goal of this study is to establish a GIS using twitter to build a legible landscape in terms of amenities and attractions. This GIS would be a combination of Web-GIS, Social Networking Sites (SNS), and tweets (Yamada and Yamamoto, 2013). To elaborate, WEb-GIS is an accessible user-generated version of GIS, allowing the public autonomy in map features. Off the beaten path areas such as this study area contain a large amount of “implicit knowledge” which Yamada and Yamamoto seek to make “explicit”. Many societal factors are active drivers in this endeavour, specifically Japan’s movement towards an “information-orientated society” and the improvements in bandwidth to allow for more internet usage, with the ultimate result being greater convenience in using Web-GIS. Yamada and Yamamoto developed a system with high portability and ease to contribute outside the region, hence the inclusion of Twitter for longevity of the system, as well as the SNS. Users in the system identify their demographics to understand user trends, such as age, sex, region of origin, as well as a user ID and greeting to promote a dialogue between users (Yamada and Yamamoto, 2013).

Social Media and Obesity Similar research has been done on the topic of social media and obesity. One example is the examination of obesity using Yahoo Answers. This study has analyzed individuals who posted questions about body weight (‘‘askers’’) and examined those askers who self-reported anthropometric data while asking a weight-related question. The body weight status of askers was then classified into four categories according to the guidelines for classification of overweight/obesity in adults by the National Institute of Health (Kuebler et al., 2013). The research team used term-matching to extract questions about his/her weight, automatically extracting all questions which contained the phrase: ‘‘Am I ,term.?’’ with ‘‘term’’ being one of: ‘‘skinny’’, ‘‘thin’’, ‘‘fat’’, and ‘‘obese’’ in their title or description (Kuebler, Yom-Tov, Pelleg, Puhl, & Muennig, 2013). The main problem in this study is that the automatic measurement extractor described above is tuned for precision, rather than recall. This means, for example, that a sentence like ‘‘I weigh 128’’ will not trigger the rule, because the missing suffix ‘‘lbs’’ (Kuebler, Yom-Tov, Pelleg, Puhl, & Muennig, 2013). As such, it is more likely to fail to report a measurement rather than report it erroneously. To validate the use of the automatic extractor, 200 randomly selected questions were inspected manually and compared with the results of the automatic extractor. Another example deals directly with geotagged Twitter data to examine neighborhood happiness, diet, and physical activity. A research team processed tweets to create variables that measured sentiment, food, and physical activity. In order to accomplish this task, they utilized a bag-of-words algorithm which created a simplifying representation of tweets that disregarded grammar and word order, but had the capacity to track the frequency of terms or components of tweets. Computations on those components and terms were then performed (Nguyen et al., 2016). The first step was to divide each tweet into tokens. After the tokens were obtained, the team members then searched each word in the word dictionary to get its corresponding happiness score for sentiment analysis. Different features like capitalized words, exclamation marks, smiley faces and other properties would add or subtract marks for the overall happiness score (Nguyen et al., 2016).

In Computational Linguistics, the tools and skills for data extraction are extensive. Computing scientists have developed numerous algorithms for different purposes. However, data type and data format has evolved dramatically as time goes. With the addition of spatial information, methods for exploring and analyzing social media data are still immature. Cloud computing, parallel computing, SQL query and a lot more are all existing methods that can be used for extracting large datasets. However, how to implement the data with spatial information or even temporal dimension is the question that rises from this literature review. In this project, we will assess language processing methods through the geospatial exploration of health trends and draw potential limitations from the overall process.