Would like to explore GLOCON dataset?
Check out our dashboard or go to SPIKE!
GLOBAL CONTENTIOUS POLITICS DATASET
Global Contentious Politics Dataset (GLOCON) is the the first comparable protest event database on emerging markets using local news sources. GLOCON counts the number of events such as strikes, rallies, boycotts, protests, riots, and demonstrations, i.e. the “repertoire of contention,” and operationalizes protest events by various social groups by including (i) spontaneous or organized protests, and (ii) protests led by organizations (political, ethnic, religious, or criminal) independent of the location of the protest event.
GLOCON (Global Contentious Politics Dataset) is one of the major results of a broader project, called Emerging Welfare (EMW). Funded by European Research Council (ERC) Starting Grant, EMW has been led by Assoc. Prof. Dr. Erdem Yoruk, from Koc University, since January 2017.
Emerging Welfare is a multimethod and interdisciplinary comparative welfare research project, investigating the politics of contemporary welfare state development in emerging market economies. The project is structured around two main research questions:
- First, are we observing a new global welfare state regime structure as a result of this welfare state expansion in emerging markets?
- And, second, what is the cause of this welfare state expansion in emerging markets? Specifically, what is the role of contentious politics?
To be able to answer these questions, we have built two separate databases through employing quantitative, computational, qualitative and comparative methods: GLOW and GLOCON. Our first dataset, Global Welfare Dataset (GLOW), focuses on welfare indicators, while the second dataset, Global Contentious Politics Dataset (GLOCON), focuses on political protests. You can find more details about GLOCON in the sections below.
Global Contentious Politics Dataset (GLOCON) is the first comparable protest event database on emerging markets using local news sources. The database creation depends on automated text processing tools that will detect if a news article contains a protest event, locate protest information within the article, and extract pieces of information regarding the detected protest events. The only exception to the use of automated text processing tools in GLOCON is Türkiye.
We define a protest event as “a collective public action by a non-governmental actor who expresses criticism or dissent and articulates a societal or political demand” (Rucht et al. 1999, 68). The EMW Project will count the number of events such as strikes, rallies, boycotts, protests, riots, and demonstrations, i.e. the “repertoire of contention” (Tarrow 1994, Tilly 1984). It will also indicate the participants, organizers, targets, location, city, facility as well as violence, urban/rural, ethnicity, religion, ideology and caste characteristics of the event. The EMW project does not intend to produce an exhaustive count for all, or even most incidences of contentious political events since newspapers report on a fraction of the events that occurred (Davenport 2009, Earl et al. 2004, Ortiz et al. 2005). The assumption is that during times of strong social movements, newspapers report social events more than usual (Silver 2003). Therefore, the database will count each time that an event is reported in order to differentiate events in terms of their importance. It intends to create a measure of the changing levels of grassroots politics events over time and space during the welfare transformation.
The case countries of GLOCON are select emerging markets, including China, India, South Africa, Mexico, Brazil and Türkiye.
VARIABLES / CONTENT:
GLOCON counts the number of events such as strikes, rallies, boycotts, protests, riots, and demonstrations, i.e. the “repertoire of contention,” and operationalizes protest events by various social groups by including (i) spontaneous or organized protests, and (ii) protests led by organizations (political, ethnic, religious, or criminal) independent of the location of the protest event.
The information types and labels in GLOCON can be found below. For detailed information, please check the Annotation Manual.
INFORMATION TYPES AND LABELS
Title of the document
Document publication time
Document publication place
Type of the Event
Time of the event
Place of the event
The facility in which the event takes place
Centrality of the event location (Urban or rural)
Semantic category of events
Types of participants
Names of participants
Ideology of the Participants
Ethnic identity of participants
Religious Identity of Participants
Caste of the Participants
Socieconomic Status of the Participants
Distinguishing between organizers and participants
The event organizer type
The Name of the Event Organizer
The Ideology of the Event Organizer
The Ethnicity of the Event Organizer
The Religion of the Event Organizer
Caste of the Event Organizer
Socieconomic Status of the Organizer
Our computer science team has been building a software that automatically generates the first comparative protest events database—GLOCON—for emerging markets. Here, we employ advanced computing techniques such as artificial intelligence, natural language processing, and machine learning in order to extract protest data from online news sources. We develop fully automated tools for document classification, sentence classification, and detailed protest event information extraction that will perform in a multi-source, multi-context protest event setting with consistent performances of recall and precision for each country context.
Our computer algorithm has two parts:
1- The classification part determines all protest-related news in online archives.
2- The extraction part, however, will identify relevant characteristics of protests such as the type of the event, organizers and locations of the event, and ideology, ethnicity, and religion of protest participants.
Shortly, the machine learning system is based on the principle that the computer can imitate what humans do. In our system, human annotators tag certain protest characteristics in news article and when a large number of annotations are created, which is called a corpus, the system gets enough training data to imitate the human reasoning behind event coding. Currently, 15 researchers, including computer scientists and social scientist annotators are developing this work package.
WHY AUTOMATED PROTEST EVENT COLLECTION & CONTRIBUTION OF OUR METHOD TO THE STATE-OF-THE-ART
Automated protest event extraction promises to overcome the prohibitive costs of human coding of protest events from a variety of sources, countries, and time periods in order to create large and comprehensive datasets of protest events. In order to obtain a generalizable and fully automated system to extract contentious political events from a variety of online news sources.
We have developed a novel bottom-up methodology that is based on a random sampling of news archives, as opposed to keyword filtering. We use state-of-the-art information retrieval methods, natural language processing, and deep learning algorithms with a focus on gold-standard corpus (GSC) creation using multiple context evaluation. The high-quality GSC is designed in a way that can accommodate context variability from the outset as it is compiled randomly from a variety of news sources from different countries. Trained graduate students of social sciences annotate randomly sampled documents – a step, which forms the basis of the training, test, and evaluation data of the final, fully-automated tools. The bottom-up approach dictates, starting from unfiltered random samples of news articles from each of our focus countries.
We develop fully automated tools for document classification, sentence classification, and detailed protest event information extraction that will perform in a multi-source, multi-context protest event setting with consistent performances of recall and precision for each country context. In order to cope with the challenges of developing generalizable tools that can handle source heterogeneity, we designed the tool development process to incorporate sources from multiple contexts. However, rather than developing tools from scratch for every context, the main aspect of our task design assigns different contexts to training, test and evaluation steps of the development cycle. The steps we followed can be listed as below:
1. DEFINING ‘PROTEST’
We define a protest event as “a collective public action by a non-governmental actor who expresses criticism or dissent and articulates a societal or political demand” (Rucht et al. 1999, 68). The EMW Project will count the number of events such as strikes, rallies, boycotts, protests, riots, and demonstrations, i.e. the “repertoire of contention” (Tarrow 1994, Tilly 1984). It will also indicate the participants, organizers, targets, location, city, facility as well as violence, urban/rural, ethnicity, religion, ideology and caste characteristics of the event.
2. CREATING A GOLD STANDARD CORPUS
This process of automated tool development depends on high quality, human-annotated gold standard corpus (GSC) of documents, which we call GLOCON Gold, as the basis of the training-evaluation cycle for each different text processing task of text classification, sentence detection, and information extraction. It currently contains more than 17,000 news articles selected using either random or active learning based sampling in English from India, China, and South Africa in Spanish from Argentine and in Portuguese from Brazil from local sources. Additional random samples from international news sources containing news about China are also included so that the corpus contains both country and source variability. All of these documents are labeled as protest or non-protest at the document level. In the sentence classification task, all sentences in protest related articles, which are more than 1,000 news articles, are labeled as containing event information or not. Finally, the positively annotated protest related sentences are annotated at the token level for the extraction of detailed event information on characteristics such as the event type, protest category of the event, event trigger co-reference, event’s place, time, and actors (participants, organizers and targets). The corpus has been utilized in order to create a pipeline of deep learning-based machine learning (ML) models. The variety of sources has allowed us to study and improve the cross-context robustness and generalizability of the ML models.
If you like to use GLOCON Gold, please contact us for details.
3. SECURING COMPLETENESS, VALIDITY AND CONSISTENCY
Our corpus design aims at securing completeness and validity through a variety of principles and mechanisms. In order to secure utmost completeness, we start with unfiltered random samples of documents compiled from every news source in order to be able to incorporate country and source-specific characteristics of protest events from scratch. The random sampling approach makes the task challenging but closer to reality. Keyword lists tend to miss certain protest events of a common type when they are indirectly or less explicitly mentioned (e.g. “workers stopped working” referring to a strike) (Hürriyetoğlu et al., 2019). They might even exclude a whole class of events due to the lexical variance across contexts when referring to particular event types (e.g. “dharna” and “bandh”, referring to sit-in and citizen strike respectively in India). In order to take full advantage of random sampling in terms of completeness, a domain expert in politics of the target country gives country-specific training on contentious politics to our annotators for better understanding of dynamics of social movements and peculiar and/or prevalent types of protest that are specific to each country. This enables them to better recognize protest events even when reporting is less explicit and indirect, and when they encounter local event characteristics that might not be obvious to someone who is not familiar with that context.
Checking the quality of annotations to maintain high levels of validity and consistency is performed through spot-checks, i.e., 10% of annotator agreements are checked by the annotation supervisor for mistakes, and a cross-task feedback mechanism that is enabled by our process design. The three levels of annotation are separate but integrated in the sense that they form a pipeline in which a single document goes through each individual step, and each step is built upon the result of the previous step. The aim here is to maximize time and resource efficiency and performance by utilizing the feedback of each level of annotation for the whole process. This, in turn, enables error analysis and optimization during annotation and tool development efforts, which improves IAA and lessens the time spent for quality check. What is more, the human-annotated GSC compiled from unfiltered random samples in this bottom-up and interlinked work process allows us to measure the markers of completeness and validity transparently and consistently. Human annotation and automated tool performances are constantly compared and checked against one another and all data in the GSC are optimized in multiple steps of checks, error corrections and fine-tuning.
PUBLICATIONS BY THE TEAM:
“Cross-Context News Corpus for Protest Event-Related Knowledge-Base Construction”, Data Intelligence, 3(2), 308–335. Hürriyetoğlu, A., Yörük, E., Mutlu, O., Duruşan, F., Yoltar, Ç., Yüret, D., & Gürel, B. (2021). doi:10.1162/dint_a_00092
“Random Sampling in Corpus Design: Cross-Context Generalizability in Automated Multicountry Protest Event Collection”, American Behavioral Scientist, 0(0), 00027642211021630. Yörük, E., Hürriyetoğlu, A., Yoltar, Ç., & Duruşan, F. (2021). doi:10.1177/00027642211021630. eprint: https://doi.org/10.1177/00027642211021630
“Automated extraction of socio-political events from news (AESPEN): Workshop and shared task report”, Proceedings of the Workshop on Automated Extraction of Socio-political Events from News (AESPEN), 2020. (A. Hürriyetoğlu, V. Zavarella, H. Tanev, E. Yörük, A. Safaya and Osman Mutlu)
“Analyzing ELMo and DistilBERT on Socio-political News Classification“, Proceedings of the Workshop Automated Extraction of Socio-political Events from News (AESPEN), 2020. (B. Büyüköz, A. Hürriyetoğlu and A. Özgür)
“Overview of CLEF 2019 Lab ProtestNews: Extracting Protests from News in a Cross-context setting“, in F. Crestani, et al. ed. Experimental IR Meets Multilinguality, Multimodality, and Interaction, Cham. Springer International Publishing, 2019. (Ali Hürriyetoğlu, Erdem Yörük, Deniz Yuret, Çağrı Yoltar, Burak Gürel, Fırat Duruşan, Osman Mutlu and Arda Akdemir).
“A Task Set Proposal for Automatic Protest Information Collection Across Multiple Countries“, European Conference on Information Retrieval, 2019. (Erdem Yörük, Deniz Yüret, Çağrı Yoltar, Burak Gürel, Fırat Duruşan and Osman Mutlu)
“Towards Generalizable Place Name Recognition Systems: Analysis and Enhancement of NER Systems on English News from India”, 12th Workshop on Geographic Information Retrieval, November 6, 2018, Seattle, WA, USA. (Arda Akdemir, Ali Hürriyetoğlu, Erdem Yörük, Burak Gürel, Çağrı Yoltar and Deniz Yüret) https://doi.org/10.1145/3281354.3281363