Theories/Practice of Big Data [for Social Scientists]

Sociology 290: Theories/Practice of Big Data

Fall 2016
Wednesdays 12:00-14:50
SSB 414

Juan Pablo Pardo-Guerra


What can social scientists do with ‘big data’? And, how does ‘big data’ alter the qualities of our craft? Organized as a workshop, this course explores these and related questions by examining the challenges, limits and possibilities of so-called big data for contemporary social scientists. In doing so, the course aims to develop theoretical attentiveness, methodological awareness and practical experience in handling, analyzing and presenting results derived from using ‘big data’. The course involves developing three group projects.

This course this year will tackle the question of whether algorithms (and, by consequence, big data) are performative. Do algorithms change their publics through data? If so, do they reproduce the logics of their embedding organizations? (There is much chatter about, for instance, the self-referential feedback of news in Facebook walls, or the homophily of Twitter networks. Can we find further evidence of such  algorithmically configured echo-chambers?)


There is no right way of doing ‘big data’ in the social sciences—there are, at best, sensibilities about what works well and under what conditions. Developing these sensibilities requires patience, collaborative work, and a continuous exchange of ideas: as a student of the course, then, you are expected to attend each meeting, do the readings in advance, prepare questions for each session, and work with your group after class. I also don’t presume that this course will make you a coding wizard or an exceptional ‘big data’ theorist—my more modest ambition is to provide you with the tools that you need in order to become a resourceful bricoleur who knows how to use off-the-shelf and built-from-scratch instruments to explore social life.


This course is assessed through a combination of participation, group projects, and individual reports. The assessments and their relative weights are as follows:

  1. Three project reports, due on weeks 4, 7 and 10. Project reports should be 2,000 to 3,000 words in length, and must be written like a short research article (in the style, for instance, of the Proceedings of the National Academy of Sciences). This involves introducing a research question, surveying the relevant literature, describing data and methods, presenting results, and offering a conclusion. Each report will contribute towards 15% of the final grade. (45%)
  1. Weekly progress reports, due on Friday 6:00 p.m. in weeks 2, 3, 5, 6, 8 and 9. These brief reports (300-500 words in length) should outline the progress of the group’s projects as well as plans for the following week. Each weekly progress report will have to specify a question on which the group would like to receive feedback as well as a practical topic for discussion in class (e.g. “How do I produce topic models?” or “How do I scrape a website?”). The six equally weighted reports will constitute 15% of the final grade. (15%)
  1. Personal report: a 2,000-2,500 word personal report, due on Friday week 10, and reflecting on what makes big data distinct in the social sciences, will represent an additional 20% of the final grade.
  1. Presentations: groups will be asked to present the findings of their projects in weeks 4, 7 and 10. Presentations will be assessed at 20% of the final grade.
  1. Bonus: after the group presentations in weeks 4, 7 and 10, there will be a vote for the best project of the unit. A sealed (though not anonymous) ballot will determine a winner. This will produce a ranking for the overall course. The top groups in the ranking will be rewarded with an additional 5 points on their grades. (Specific rules will apply to avoid strategic voting across groups).


Note: This is a student-driven course. While weeks 2, 5 and 8 introduce the broad conceptual and methodological issues related to each of the three units (and therefore have a relatively stable reading list), the contents for weeks 3, 6 and 9 will depend on student interest and the specific methodological challenges of your projects. Contents may include: programing languages (Python and R), the use of off-the-shelf software (ConText, CitNetExplorer, Gephi, Cytoscape), data scraping, natural language processing (ngrams, sentiment analysis, topic modeling), machine learning, data visualization, M-Turk surveys, and other relevant themes.

(* indicates required readings)

WEEK 1 – Small data, big data, no data

What is big data, and what makes it so significant? In this week, we will start our exploration by thinking about the production, consumption and politics of so-called big data. We will also walk through some about practical issues (knowledge of programing or lack thereof as well as course organization).

*boyd, danah and Kate Crawford (2012) “Critical questions for big data: provocations for a cultural, technological, and scholarly phenomenon”, Information, Communication & Society 15(5)  DOI: 10.1080/1369118X.2012.678878

*Kitchin, Rob (2014) “Big Data, new epistemologies and paradigm shifts” Big Data & Society 1 (1) DOI: 10.1177/2053951714528481

*Anderson, Chris (2008) “The end of theory: the data deluge makes the scientific method obsolete” Wired June 23

*Pugliucci, Massimo (2009) “The end of theory in science?” EMBO Rep. 2009 Jun; 10(6): 534. DOI: 10.1038/embor.2009.111

Amoore, Louise and Volha Piotukh (2015) “Life beyond big data: governing with little analytics”, Economy and Society, 44(3)
DOI: 10.1080/03085147.2015.1043793

Boellstorff, Tom (2013) “Making big data, in theory” First Monday, 18(10)

Desroisieres, Alain (1991) “How to Make Things Which Hold Together: Social Science, Statistics and the State” in Peter Wagner et. al. Discourses on Society: The Shaping of the Social Science Disciplines Springer

UNIT 1 – AN ARCHAEOLOGY OF DATA ———————————————————

For this unit, groups will have to select a platform / organization of interest. This may be Facebook, Twitter, Yelp, the US Congress, or any other organization that makes available text through digital means. The objective of this unit is to understand how the production of data within these platforms/organizations evolved, changing the ‘rawness’ of data over time. For the purpose of our discussions, I will focus on Google its algorithmic history.

Empirically, groups will be tasked with analyzing temporal changes in the quality of data produced by different platforms. Are tweets produced in 2007 similar to those produced today? How do they vary? How are they similar?

Relevant Computer Science Techniques: This unit uses a combination of skills. Students will be expected to 1) scrape textual data using Python (e.g. BeautifulSoup); 2) identify relevant ngrams; 3) and implement unsupervised classification algorithms (including as LDA-based topic models).

WEEK 2 – Data as organizational technique

*Fourcade, Marion and Kieran Healy (2013) “Classification Situations: Life-chances in the Neoliberal Era.” Accounting, Organizations, and Society 38: 559–572.

*Rosenberg, Daniel (2013) ‘Data before the fact’ in Gitelman, Lisa, “Raw Data is an oxymoron”, MIT Press: Cambridge, MA.

* Orlikowski, Wanda J. and Scott, Susan V. (2014) What happens when evaluation goes online? exploring apparatuses of valuation in the travel sector. Organization Science, 25 (3). pp. 868-891.

Rieder, Bernard (2012) “What is PageRank? A historical and conceptual investigation of a recursive status index” Computational Culture Available at:

Hillis, Ken, Michael Petit and Kylie Jarrett (2013) Google and the culture of search Rourtledge: London.

Google algorithm changes available at:

Introna, Lucas. (2015) “Algorithms, governance and governamentality” Science, Technology and human Values

McKenzie, Adrian. (2005) “The performativity of code: software and cultures of circulation” Theory, Culture and Society 22(1)


Readings will be determined on the basis of the discussions held during the previous week.

WEEK 4 – Presentations

UNIT 2 – THE LIVES OF OTHERS —————————————————————

In this unit, working groups will emulate archaeologists preoccupied with the history of the individual actors and their context. In particular, groups will be tasked with reconstructing the preferences and/or trajectories of the actors that produce data in their platforms of choice by studying such things as changes in textual patterns, registers of distinction (cf. Bourdieu), and/or changes in political worldviews (that might be relevant for thinking about the 2016 election). This will require working with a theoretically sound hypothesis of actors and their worlds, but it will also involve several data collection and data processing skills: in addition to scraping multiple sources of semi-structured data, groups will have to code heterogeneously-collected data points either through automated techniques (e.g. sentiment analysis) or the use of crowd-sourced humans (e.g. Mechanical Turk).

Relevant Computer Science Techniques: Building on the previous unit’s exercise, students will have to 1) scrape information off an online bibliographic reference repository; 2) implement algorithms that identify cases within the collected data; 3) identify communalities across the cases by implementing a clustering algorithm (such as k-means).


*Bourdieu, Pierre (2010) Distinction Routledge: London (extracts)

*Lawson, A, L. Ferrer, W. Wang and J. Murray (2015) “Detection of Demographics and Identity in Spontaneous Speech and Writing” in Baughman, A et. al. Multimedia Data Mining and Analytics, Springer. DOI: 10.1007/978-3-319-14998-1_9

*Colleoni, Elanor, A. Rozza, A. Arvidsson. (2014). “Echo chamber or public sphere? Predicting political orientation and measuring political homophily in Twitter using big data” Journal of Communication 64(2).

*McFarland, Daniel et. al. (2013) “Differentiating language usage through topic models” Poetics 41(6): 607–625. [In general, this issue of Poetics is of relevance]

Grimmer, Justin and B. Stewart. (2013). “Text as data: the promise and pitfalls of automatic content analysis methods for political texts” Polytical Analysis 21(3)

Tufekci, Zeynep. (2014). “Engineering the public: Big data, surveillance and computational politics” First Monday 19(7).

Bernstein, Basil (1960) “Language and social class” British Journal of Sociology 11(3): 271-276


Readings will be determined on the basis of the discussions held during the previous week.

WEEK 7 – Presentations 

UNIT 3 – BIRDS IN AN ECHO CHAMBER?——————————————————-

Having looked at individual preferences/trajectories, we will now ask the question of whether we can recover the structure of fields and its evolution over time. In particular, we will try to understand how polarization is created through a combination of homophily and feedback within relatively closed fields/platforms.

Relevant Computer Science Techniques: In addition to scraping textual datasets, students are expected to 1) implement algorithms that construct networks from the data; 2) implement relevant classification algorithms (such as supervised or unsupervised topic models) over the collected corpus, 3) model the evolution of semantic patterns over time and as a function of transformations in the network structure.


Nahon, K. and Hemsley, J. (2014). “Homophily in the guise of cross-linking: political blogs and content” American Behavioral Scientists

Halberstam, Y. and B. Knight. (2014). “Homophily, group size, and the diffusion of political information in social networks: evidence from Twitter” NBER Working Paper No. 20681

Boutyline, A. and R. Willer. (2015) “The social structure of political echo chambers: variation in ideological homophily in online networks” Woking paper, available at:

Alix Rule, Jean-Philippe Cointet, and Peter S. Bearman (2015) “Lexical shifts, substantive changes, and continuity in State of the Union discourse, 1790-2014” Proceedings of the National Academy of Sciences 112(35): 10837-10844


Readings will be determined on the basis of the discussions held during the previous week.

WEEK 10 – Presentations and final wrap-up