Nicola Righetti & Paul Balluff 2024-03-28
We introduce CooRTweet
an R-Package for detecting
coordinated behavior on social media. Named after Twitter (now X), a
prototypical social media platform for coordinated message amplification
through its hashtags and trending topics affordances,
CooRTweet
is a general-purpose tool whose functionalities
apply to any social media platform and even extend beyond social media.
It enables the analysis of coordinated behavior employed by any entity
to disseminate any content (e.g., hashtags, URLs, images, messages, or
any other identifiable objects) via any media. It further opens up the
possibility of cross-platform analysis.
Coordinated behavior has been defined as “the act of making people and/or things involved in organized cooperation” (Giglietto et al. 2020, 872). Coordinated behavior on social media has been used for political astroturfing (Keller et al. 2020), spreading inappropriate content (Giglietto et al. 2020), and activism. Detecting such behavior is crucial for academic research and investigative journalism.
Software for academic research and investigative journalism has been developed in the last few years to detect coordinated behavior, such as the CooRnet R package (Giglietto et al. 2020), which detects Coordinated Link Sharing Behavior (CLSB) and Coordinated Image Sharing on Facebook and Instagram (CooRnet website).
The CooRTweet
package builds on the existing literature
on coordinated behavior and the experience of previous software to
provide an easy-to-use tool for detecting various coordinated networks.
The package is powered by data.table
(Dowle and Srinivasan
2022) which makes efficient use of memory and is considerably fast.
The package is compatible with any social media, as long as the data
set contains the required variables. It offers native support for the
Twitter Academic API V2 in JSON format and includes a simple convenience
function (prep_data
) for preparing other types of data in
the format necessary for the package.
Regarding the Twitter data gathered using the R package
academictwitteR
and its get_all_tweets
function, which simultaneously retrieves tweets and user information,
the CooRTweet
convenience function load_data
employs the leading-edge method for parsing large volumes of JSON data
in the most rapid manner achievable (Eddelbuettel, Knapp, and Lemire
2023).
An action \(a\) on social media can be formalized as an account \(u\) posting content \(p\) at time \(t\):
\[a = (p, t)\]
Following the standard operationalization in literature, two or more accounts are defined as coordinated when they perform the same action at least \(r\) times, within a predefined time interval \(\tau\). This so-called “same action” can be operationalized in a variety of ways:
In CooRTweet
we refer to the content on which we track
the “same actions” as objects. In turn, each object constitutes
a potentially coordinated action, which means that all potentially
coordinated actions \(A\) are a set of
unique objects: \(A = \{o_1, o_2, \ldots,
o_n\}\).
Formally, two accounts \(u_1\) and \(u_2\) are coordinated when their posts \(p_1\) and \(p_2\) contain the same object \(o\) and the time interval \(\Delta t = |t_1 - t_2|\) is smaller than \(\tau\): \(\Delta t \le \tau\).
We group all posts according to all uniquely identifiable actions. \(n(A) = N\) is the total number of potentially coordinated groups. For example, if your dataset has 100 unique URLs then one URL is a object \(o_i\) and \(n(A) = N = 100\).
Coordination detection in CooRTweet is executed through two
sequential steps, facilitated by the functions
detect_groups
and
generate_coordinated_network
.
The detect_groups()
function enables the identification
of accounts who shared the same objects (denoted as
object_id
) within a predefined time interval,
time_window
(represented by \(\tau\)). Additionally, the function
includes a parameter, min_participation
that ensures that
only accounts with a minimum level of activity in the original dataset
are included in the subsequent analysis.
This function returns a data.table object, which is subsequently
processed by the generate_coordinated_network
function.
This function completes the final stage of coordinated analysis. It
involves filtering accounts who performed identical actions within the
same timeframe, in accordance with the degree of repetition. The
underlying assumption is that two accounts may coincidentally share the
same objects within the same time window; however, the likelihood of
them repeatedly sharing the same object within the same time window is
considerably lower (Giglietto et al. 2020). The degree metric serves to
operationalize the concept of repetition. Furthermore, the function
computes an edge_symmetry_score
, which aids in evaluating
the impact of the number of shares contributed by each user on the
edge.
Based on these two functions, CooRTweet
identifies
coordinated actors and networks. Further information is provided in the
function’s documentation.
We provide an anonymized version of a real dataset of coordinated tweets by pro-government accounts in Russia (Kulichkina, Righetti, and Waldherr 2022). You can load the sample dataset as follows:
The dataset has four columns which is the minimum required input data for detecting coordinated behavior:
object_id
: the coordinated content. In this example the
ID of retweeted content.account_id
: the unique ID of a twitter account.content_id
: the unique IDs of the twitter accounts’
posts.timestamp_share
: the exact time content_id
was posted by the user.The length of content_id
should be the same as the
number of rows of your input data
Let’s assume that we want to detect coordinated behavior with a
min_participation
of 2 shares and a
time_window
of 600 seconds. We can call the first function
detect_groups()
as follows:
The result
is a data.table
that only
includes the accounts and their contents that were shared within the
given parameters. The result
is in a wide-format, where it
shows the time difference (time_delta
) between two posts
(content_id
and content_id_y
).
result
is sorted in such a way that the “older” posts are
represented by content_id
and the “newer” posts by
content_id_y
. For example, if User A retweets a post of
User B, then the Tweet by User A is the “newer” post. Sorting the
result
this way has the advantage that the direction of
cascaded coordination can be tracked.
We set the minimum participation filter to 2 to ensure that only accounts that have contributed at least two pieces of content in the activity under scrutiny are included in subsequent analyses.
combined_accounts <- c(result$account_id, result$account_id_y)
combined_accounts_dt <- data.table::data.table(account_id = combined_accounts)
account_counts <- combined_accounts_dt[, .N, by = account_id]
russian_coord_tweets <- data.table::as.data.table(russian_coord_tweets)
raw_counts <- russian_coord_tweets[, .N, by = account_id]
raw_counts_included <- raw_counts[account_id %in% combined_accounts]
# min_participation
min(raw_counts_included$N)
#> [1] 2
The coordinated detection is then completed by applying the other
function. We set the “objects” option to TRUE so that the graph keeps
the list of objects shared by accounts, for later inspection via the
group_stats
function. We also set a filter on the graph
that identifies edges with a weight greater than 99% of the edges weight
in the graph. This is used to identify accounts who repeatedly share
object_id
(i.e, any type of identified content) in the same
time_window
.
The edge_weight option creates a weight_threshold vector that is 1 if the edge exceeds the threshold value, and 0 otherwise. For example, in this case, the threshold value corresponds to a minimum edge weight of 3.
library(igraph)
#>
#> Attaching package: 'igraph'
#> The following objects are masked from 'package:stats':
#>
#> decompose, spectrum
#> The following object is masked from 'package:base':
#>
#> union
min(E(coord_graph)$weight[E(coord_graph)$weight_threshold == 1])
Edge weight is not a perfect measure in an undirected graph, as it
can be influenced by extreme values from a user. Therefore, an
equilibrium measure, balancing the contributions of each of the two
nodes on every edge, is concurrently computed. This measure, called
edge_symmetry_score
, equals 1 when the contribution is
perfectly even and approaches zero in other cases.
We can quickly get some summary statistics by using the provided
convenience functions group_stats()
and
account_stats()
. If we are interested in the content that
accounts share in a coordinated fashion, we can call
group_stats()
and pass in our igraph
object
from the generate_coordinated_network
function:
summary_groups
shows how many accounts (column
num_accounts
) participated for each unique shared object
(object_id
).
If you are interested in understanding more about the users you can
call account_stats()
:
summary_accounts <- account_stats(coord_graph = coord_graph, result = result, weight_threshold = "full")
The documentation for each function includes details and possible options).
You can focus on a narrower time window by updating the result of the
detect_group
function via the flag_speed_share
function.
result_update <- flag_speed_share(russian_coord_tweets, result, min_participation = 2, time_window = 120)
This function creates a new column marking the edges that meet the new condition.
Using special options of the
generate_coordinated_network
function, we can get the graph
of accounts who have shared content faster and whose edge are above the
threshold (subgraph = 2
). Other options allow for the
general network filtered by edge weight (subgraph = 1
) or
the subgraph whose nodes exhibit coordinated behavior in the narrowest
time window established with the flag_speed_share
function
(fast subgraph), and the vertices adjacent to their edges
(subgraph = 3
).
coord_graph_fast <-
generate_coordinated_network(
result_update,
fast_net = TRUE,
edge_weight = 0.99,
subgraph = 2
)
Any dataset can be utilized with CooRTweet, provided it includes the
necessary data. The convenience function prep_data
facilitates the creation of an appropriate data format for further
processing. Users need only to specify the columns in their dataset
corresponding to the required ones, namely, a column with the desired
object to be tracked (object_id
), the account (or user) IDs
(account_id
), the IDs of the content featuring the object
(content_id
), and the timestamps of the shares
(timestamp_share
).
prep_data <-
function(x,
object_id = NULL,
account_id = NULL,
content_id = NULL,
timestamp_share = NULL
)
If you want to use the package with your own data that you retrieved from the Twitter API (V2), we guide you here quickly through the process.
We assume that all your tweets are stored as JSON files in a
directory. You can load the JSON data with the
load_tweets_json()
and
load_twitter_users_json()
functions
# load data
raw <- load_tweets_json('path/to/data/with/jsonfiles')
users <- load_twitter_users_json('path/to/data/with/jsonfiles')
If you cannot load your Twitter data, please feel free to raise an issue in our Github repository. We are happy to help!
Twitter data is nested and difficult to handle, so we also provide a simple pre-processing function that unnests the data:
# preprocess (unnest) data
tweets <- preprocess_tweets(raw)
users <- preprocess_twitter_users(users)
The resulting tweets
is a named list, where each item is
a data.table
. The five data.table
s are:
tweets
, referenced
, urls
,
mentions
, and hashtags
. This keeps the data
sorted and avoids redundant rows.
To access the tweets you can simply use tweets$tweets
and view your dataset.
The reshape_tweets
function makes it possible to reshape
Twitter data for detecting different types of coordinated behavior. The
parameter intent
of this function permits to choose between
different options: retweets
, for coordinated retweeting
behavior; hashtags
, for coordinated usage of hashtags;
urls
to detect coordinated link sharing behavior;
urls_domain
to detect coordinated link sharing behavior at
the domain level.
# reshape data
retweets <- reshape_tweets(tweets, intent = "retweets")
# detect coordinated tweets
result <- detect_groups(retweets, time_window = 60, min_participation = 10)
coord_graph <- generate_coordinated_network(result, edge_weight = 0.95)
hashtags <- reshape_tweets(tweets, intent = "hashtags")
result <- detect_groups(hashtags, time_window = 60, min_participation = 10)
coord_graph <- generate_coordinated_network(result, edge_weight = 0.95)
urls <- reshape_tweets(tweets, intent = "urls")
result <- detect_groups(urls, time_window = 60, min_participation = 10)
coord_graph <- generate_coordinated_network(result, edge_weight = 0.95)
urls <- reshape_tweets(tweets, intent = "urls_domain")
result <- detect_groups(urls, time_window = 60, min_participation = 10)
coord_graph <- generate_coordinated_network(result, edge_weight = 0.95)
There are two functions that give summaries of the
igraph
data resulting from the
generate_coordinated_network
function:
group_stats()
and account_stats()
.
To get insights on the objects shared in the network (groups), use
group_stats()
.Depending on whether you want statistics for
the general network, or for the fastest network if it has been computed
via the flag_speed_share
function, you can specify “fast”
or “full” in the “network” argument.
It returns a data.table
which shows the group statistics
for total count of unique accounts that shared that object.
If you are interested in the account statistics, you can pass the
igraph
resulting from
generate_coordinated_network
into
account_stats()
. Depending on whether you want statistics
for the general network, or for the fastest network, if it has been
calculated via the flag_speed_share
function, you can
spefic “fast,” or you need to specify “full,” or “none,” in the
“weight_threshold” argument.
summary_accounts <- account_stats(coord_graph = coord_graph, result = result, weight_threshold = "fast")
It provides summary statistics for each account in the network: total
coordinated posts shared (content_id
), and average time
delta (more specifically, this value represents the average of the mean
time_delta values of each account). High number of posts shared and low
average time delta might suggest highly coordinated (and potentially
automated) account behavior.
Dowle, Matt, and Arun Srinivasan. 2022. Data.table: Extension of ‘Data.frame‘. https://CRAN.R-project.org/package=data.table.
Eddelbuettel, Dirk, Brendan Knapp, and Daniel Lemire. 2023. RcppSimdJson: ’Rcpp’ Bindings for the ’Simdjson’ Header-Only Library for ’JSON’ Parsing. https://CRAN.R-project.org/package=RcppSimdJson.
Giglietto, Fabio, Nicola Righetti, Luca Rossi, and Giada Marino. 2020. “It Takes a Village to Manipulate the Media: Coordinated Link Sharing Behavior During 2018 and 2019 Italian Elections.” Information, Communication & Society 23 (6): 867–91. https://doi.org/10.1080/1369118X.2020.1739732.
Keller, Franziska B., David Schoch, Sebastian Stier, and JungHwan Yang. 2020. “Political Astroturfing on Twitter: How to Coordinate a Disinformation Campaign.” Political Communication 37 (2): 256–80. https://doi.org/10.1080/10584609.2019.1661888.
Kulichkina, Aytalina, Nicola Righetti, and Annie Waldherr. 2022. “Pro-Democracy and Pro-Regime Coordination in Russian Protests: The Role of Social Media.” In 72nd Annual ICA Conference, One World, One Network‽.