Methodology

METHODOLOGY AND APPROACH

The purpose of the second work package of the EMW Project was to draw a global map of contentious politics. GLOCON, the first comparative contentious politics event database for the countries of the global south, was created for this purpose. The dataset contains information extracted from the news reports that are featured in the most prominent online sources that were accessible at the time of preparation. The GLOCON records contentious politics events (referred to as protest events for the sake of brevity) that take place within the borders of our focus countries with all the information available in the source about the time, place, and type of the event, and who is organizing and participating in it. As of the moment, the GLOCON database contains protest event data from Turkey, India, South Africa, Argentina, and Brazil. The data were collected from sources in four languages: sources in Turkish for Turkey data, English for India and South Africa data, Spanish for Argentina data, and Portuguese for Brazil data. The database was created in a way that can accommodate additions of other focus countries and/or news sources in the future.

GLOCON protest event data, except for ⮕ data on Turkey which were created manually, were prepared by the use of advanced computing techniques such as artificial intelligence, natural language processing (NLP), and machine learning that automatically extract protest data from online news sources. We developed fully automated tools for document classification, sentence classification, and detailed protest event information extraction that performs in the multi-source, multi-context, and multilingual protest event setting with consistent performance markers for recall and precision for each country context.

Our computer algorithm has three parts:

1- For protest event information extraction, we employ a multi-task model that combines protest detection in multiple granularities to achieve a higher performance for the task (Mutlu, 2022). This process extracts one or more event candidates (possible event) and identifies relevant characteristics of these such as the organizer(s), target(s), participant(s), trigger and location(s) of the event.

2- Additional information about events is discovered with multiple document and sentence classification models. Semantic types of events, organizers and participants, and whether an event: is violent or not, is in an urban or rural setting, is determined in this step.

3- Finally, candidate events are geolocated and events that have no place name or that are outside the country of used source are discarded.

The machine learning system that created our algorithm is based on the principle that the computer can imitate what humans do. A human annotated corpus of data acts as the training, test, and evaluation data which the system receives to be able to imitate the human reasoning behind event coding. The scope and quality of corpus -that is, the level of variety it incorporates in terms of country, language, and source multiplicity, and the consistency of the coding of every type of event-related information across this diversity- determine in large part the accuracy and completeness of the automated system in the end. Hence, it is referred to as the gold standard corpus (GSC). The creation of the GSC has been the major part of the time and effort that has gone into the creation of GLOCON. Skilled annotators, all graduate students in social sciences, have annotated over 17,000 documents under the supervision of and according to the annotation manual prepared by domain experts.

What is a protest event and who is a protester?

The GLOCON Project creates a database of protest events from a wide spectrum of collective action forms, what is referred to as the “repertoire of contention” by Sidney Tarrow (1994) and Charles Tilly (1984). We define protest as collective acts of non-governmental actors that aim to criticize or oppose political, economic, or social order, voice societal or political demands or grievances, or engage in political conflict with other actors. We identify five major types of protest and classify real-world events into them: demonstrations, industrial actions, group clashes, armed militancy, and electoral politics. Protest, defined as such, includes a wide variety of events including marches, sit-ins, rallies, strikes, boycotts, riots, clashes, bombings or any violent assaults against citizens, government institutions, or security forces, petition campaigns, hunger strikes, and many more.

We identify actors associated with protest in three parts: organizers, who come up with and/or lead the protests, participants, who are there to protest, and targets, the entity that the protest is directed against such as governments, leaders, or any social or political entity that is opposed. We classify participants and organizers into different categories as well. Myriad real social actors who engage in protest are classified into ten participants (e.g. student, worker, peasant, activist, militant, etc.) and six organizer classes (e.g. party, union, NGO, etc.). You can find the detailed definitions of each event, participant, and organizer category and the rules that determined their coding in our annotation manual.

Why have we automated the event database?

Protest event databases have been essential tools for social scientists since at least the 1970s, particularly in the area of research on social movements. Scanning newspaper archives and creating lists of events to create continuous time series of social contention over decades is a robust, although in terms of time and effort spent, highly costly, research tool. In addition to said costs, there have also been methodological difficulties. One difficulty concerns maintaining consistency in coding information over long periods of time and among many researchers who do the coding. Although humans are excellent at interpreting textual information accurately, they do have personal biases that they introduce to coding which is difficult to control while creating a big database, as it necessitates almost replicating an already lengthy and arduous process. News sources also introduce bias which is inevitable and cannot be eliminated, but the bias of each source may be balanced by that of another –a right-wing editorial bias of one paper can be mitigated by a left-wing one, preferences of an international source that targets a global (read western) audience can be balanced by using local sources, etc. This of course adds to the already high costs of manual event coding.

Automated event coding promises to overcome these problems. Automation enables processing large amounts of content that span over long periods and a global scale of space. Automated tools can also apply consistent principles of coding that can be checked, corrected, and re-applied relatively easily. This has the added benefit of making the methodology easily reproducible, enabling cooperation and collaboration between different research efforts, and not the least, more thorough peer review. Lastly, automated text processing allows using any machine-readable text, greatly expanding the number and variety of sources that a project can work with and thereby allowing researchers to achieve relative reliability, which is defined as the ability to balance source biases and represent the actual world more completely.

This being said, automated methods have their challenges in terms of reliability. Lacking the contextual knowledge and interpretive flexibility of human coding, automated databases need to take extra steps to avoid being too rigid and misleading researchers with unreliable, that is, incomplete and erroneous information.

How do we secure validity and completeness?

The training process of automated text processing tools contains the most important steps in securing accuracy in information classification and extraction. The high quality of human-coding can be imparted to the tools by using the human coded training data of the highest quality (meaning coded with the utmost consistency and accuracy) in the right amount (meaning containing enough instances of different types of events and actors to represent real-world social variability). GLOCON project achieved this with the GLOCON Gold, a corpus of documents that as of the moment contains 17,000 news articles that were sampled mainly randomly. Some of the documents were sampled via active learning, a method of over-sampling relevant documents based on earlier random sample annotations. It includes local sources from India, South Africa, Brazil, and Argentina in English, Portuguese and Spanish, as well as a few international sources to secure country and source variability. All of these documents are labeled as to whether they contain information on a protest event on a document level. Those that do, are annotated on the sentence level to train models that can detect event-related sentences. Finally, the event related sentences are annotated at the word level for the extraction of detailed event information on characteristics such as the event type, protest category of the event, event trigger co-reference, event’s place, time, and actors (participants, organizers, and targets). All annotation was done according to the same rules compiled in our annotation manual which guide every step of coding in minute details to secure consistency.

Every document contained in the GSC was annotated by two annotators, who were graduate students of social sciences at Koç University in Turkey and the University of Sao Paulo in Brazil, and adjudicated by a domain expert who also maintained the annotation manual. 10% of the annotations were spot-checked for mistakes to secure validity and consistency, and a cross-task feedback mechanism that is enabled by our process design whereby each consecutive level of annotation by itself provides a checking of the previous level further boosted the accuracy of the human coding. This, in turn, has enabled error analysis and optimization during annotation and tool development stages, which improves inter-annotator agreement and lessens the time spent on quality checks. Furthermore, the human-annotated GSC compiled from unfiltered random samples in this bottom-up and interlinked work process has allowed us to measure the markers of completeness and validity transparently and consistently. Human annotation and automated tool performances are constantly compared and checked against one another and all data in the GSC are optimized in multiple steps of checks, error corrections, and fine-tuning. Finally, shared task events were organized by the research team whereby other researchers could test and experiment with the GLOCON training data. These produced results that are in line with our final system and proved to be efficient tools for external review.

This high-quality GSC has been utilized to create a pipeline of state-of-the-art deep learning-based machine learning (ML) models that gave us the GLOCON dataset, which we offer on this website, with the hope that it will aid social scientists from all around the world, and facilitate and improve event-based contentious politics research.