- Social Media and Public Health Research
- Python & Sublime Text
- What is an Application Programming Interface?
Day 2: JSON Data
Day 3: XML Data
Day 4: The Rest of the Pipeline
Day 5: Next Steps
--
Social media provide a window into trends in the general population in near real-time. They also provide a means for outreach and to assess the effectiveness of interventions. Analyses of social media have provided insights into the dynamics of drug use1, the response the natural disasters2, the dynamics of foodborne illnesses3,4,5 and the dynamics of infectious diseases6.
The infrastructure created for social media was meant to share data reliably in the moment, not to support research. This has led to irreproducible results7 and inference of causation with implausible conceptual models8. Digital epidemiology holds promise for increasing the reach and rapidity of traditional means of syndromic surveaillance9,10
More plausible models and robuse findings could arise if public health researchers could be involved earlier the process of acquiring data from social media. This seminar aims to give public health researchers the tools to develop prototype data analysis pipelines from social media.
Python is a programming language. Idiomatic Python reads close to English (style guide).
print "Hello World"
An application programming interface refers to the protocols programs use to exchange data. An API allows a data source (e.g Twitter) to serve data to a user. Those data are available to a user after logging in. Logging in allows the data source to track and regulate the distribution of its data.
Most APIs require two keys to login. Roughly speaking, one key identifies yourself. The second confirms your identity. (A deeper explanation)
There are many Python wrappers to Twitter's API (overview).
from twython import Twython
import json
# Load credentials from json file
with open("twitter_credentials.json", "r") as file:
creds = json.load(file)
# Instantiate an object
api = Twython(creds['CONSUMER_KEY'], creds['CONSUMER_SECRET'])
# Create our query
query = {'q': 'learn python',
'result_type': 'popular',
'count': 10,
'lang': 'en',
}
api.search(q)
The code block above demonstrates how to access Twitter's api via the twython wrapper.
Twitter's API provides tweets as JSON objects (specification)
{
"created_at": "Thu May 10 17:41:57 +0000 2018",
"id_str": "994633657141813248",
"text": "Just another Extended Tweet with more than 140 characters, generated as a documentation example, showing that [\"tru… https://t.co/U7Se4NM7Eu",
"display_text_range": [0, 140],
"truncated": true,
"user": {
"id_str": "944480690",
"screen_name": "FloodSocial"
},
"extended_tweet": {
"full_text": "Just another Extended Tweet with more than 140 characters, generated as a documentation example, showing that [\"truncated\": true] and the presence of an \"extended_tweet\" object with complete text and \"entities\" #documentation #parsingJSON #GeoTagged https://t.co/e9yhQTJSIA",
"display_text_range": [0, 249],
"entities": {
"hashtags": [{
"text": "documentation",
"indices": [211, 225]
}, {
"text": "parsingJSON",
"indices": [226, 238]
}, {
"text": "GeoTagged",
"indices": [239, 249]
}]
}
},
"entities": {
"hashtags": []
}
}
Web site | Link to Python wrapper |
---|---|
SDK | |
Developer Library | |
YouTube | Developer Guide |
Twitter provides varying levels of access depending on how much one can pay. Completely free access provides a 1% random sample of streaming tweets. The sampling methodology is not clear. It is for example not known how long one must weight in between sampling for the two samples to be independent.
Twitter's approach is not unique. Social media web sites (and their APIs) are meant to let other applications access current data. Sampling is provided to allow devices with low bandwidth to recieve data, not to support statistical inference.
Geographic bias The geographic information over-represents urban areas11.
- Streaming versus Historical Data Twitter package
- The
json
package - Storage (there's an API for that)
- Project Management, including How do I hand off a prototype
- Who still uses XML?
- Structure of XML data.
BeautifulSoup
andlxml
import pandas as pd
# Search tweets
dict_ = {'user': [], 'date': [], 'text': [], 'favorite_count': []}
for status in python_tweets.search(**query)['statuses']:
dict_['user'].append(status['user']['screen_name'])
dict_['date'].append(status['created_at'])
dict_['text'].append(status['text'])
dict_['favorite_count'].append(status['favorite_count'])
# Structure data in a pandas DataFrame for easier manipulation
df = pd.DataFrame(dict_)
df.sort_values(by='favorite_count', inplace=True, ascending=False)
df.head(5)
- Graphing with
seaborn
,matplotlib
- What are binary files?
- What is
MongoDB
,MySQL
?
- Study Design
- What Journals?
- What Conferences?
- What Grants?
- What Collaborators?
- How to open an issue on GitHub
- How to fork a repo
- Download the Desktop app for GitHub
- Install Python
- Install Sublime Text (Mac) -or- Install Notepad++ (PC)
Footnotes
-
Chary, M., Genes, N., Giraud-Carrier, C., Hanson, C., Nelson, L.S. and Manini, A.F., 2017. Epidemiology from tweets: estimating misuse of prescription opioids in the USA from social media. Journal of Medical Toxicology, 13(4), pp.278-286. ↩
-
Murakami, A. and Nasukawa, T., 2012, April. Tweeting about the tsunami?: mining twitter for information on the tohoku earthquake and tsunami. In Proceedings of the 21st International Conference on World Wide Web (pp. 709-710). ACM. ↩
-
Harris, J.K., Mansour, R., Choucair, B., Olson, J., Nissen, C. and Bhatt, J., 2014. Health department use of social media to identify foodborne illness—Chicago, Illinois, 2013–2014. MMWR. Morbidity and mortality weekly report, 63(32), p.681. ↩
-
Kuehn, B.M., 2014. Agencies use social media to track foodborne illness. JAMA, 312(2), pp.117-118. ↩
-
Sadilek, A., Kautz, H., DiPrete, L., Labus, B., Portman, E., Teitel, J. and Silenzio, V., 2016, March. Deploying nEmesis: Preventing foodborne illness by data mining social media. In Twenty-Eighth IAAI Conference. ↩
-
Aramaki, E., Maskawa, S. and Morita, M., 2011, July. Twitter catches the flu: detecting influenza epidemics using Twitter. In Proceedings of the conference on empirical methods in natural language processing (pp. 1568-1576). Association for Computational Linguistics. ↩
-
Butler, Declan. "When Google got flu wrong." Nature News 494, no. 7436 (2013): 155. ↩
-
Rowland, K., 2012. Epidemiologists put social media in the spotlight. Nature. ↩
-
Salathe, M., Bengtsson, L., Bodnar, T.J., Brewer, D.D., Brownstein, J.S., Buckee, C., Campbell, E.M., Cattuto, C., Khandelwal, S., Mabry, P.L. and Vespignani, A., 2012. Digital epidemiology. PLoS computational biology, 8(7), p.e1002616. ↩
-
Fung, I.C.H., Tse, Z.T.H. and Fu, K.W., 2015. The use of social media in public health surveillance. Western Pacific surveillance and response journal: WPSAR, 6(2), p.3. ↩
-
Hecht, B. and Stephens, M., 2014, May. A tale of cities: Urban biases in volunteered geographic information. In Eighth International AAAI Conference on Weblogs and Social Media. ↩