Abstract
Open and accessible data streams are crucial for reproducible research and further development. Cricket data sources are limited and are usually not in a format ready for analysis.cricketdata
R package allows the users to download the data as a tibble ready
for analysis from two primary sources: ESPNCricinfo and Cricsheet. fetch_cricinfo()
and fetch_player_data()
functions allow the user to download the data from ESPNCricinfo for
different formats of international cricket (tests, odis, T20), player
position (batter, bowler, fielding), and whole career or innings wise.
Cricsheet is another data source, primarily for ball-by-ball data. fetch_cricsheet()
function downloads the ball-by-ball, match, and player data for
different competitions/formats (tests, odis, T20 internationals, T20
leagues). The T20 data is further processed by adding more features
(columns) using the raw data. Some other functions
provide access to the individual players’ playing career data and
information about their playing style, country of origin, etc. The
package essentially provides (almost) all publicly available cricket
data ready for analysis. The package saves the user significant time in
building the data pipeline, which may now be used for analysis. Here’s
an example of project built using cricketdata
: https://dazzalytics.shinyapps.io/cricwar/
The coverage of cricket as a sport has been limited compared to other global sports. ESPN Cricinfo is the major and one of the few online platforms dedicated to cricket coverage. It started as Cricinfo in the late 90s, and it was maintained by students and cricket fans who had immigrated to North America but were eager to keep tabs on the cricket activity around the globe. ESPN acquired Cricinfo in 2007, becoming ESPN Cricinfo. It is the most extensive repository of open cricket data with the caveat that data is not in an accessible format to be downloaded easily. You would have to copy-paste (tables) or write programming scripts to access the data in a format suitable for analysis. Recently they have added a search tool, Statsguru, that lets you parse through their database, presenting results usually in a table format.
Cricsheet is another open data source for ball-by-ball data maintained by a great fan of the game, Stephen Rushe. The cricsheet provides raw ball-by-ball data for all formats (tests, odis, T20) and both Men’s and Women’s games. It is an extensive project to produce ball-by-ball data, and we hugely appreciate Stephen Rushe’s work. The data is available in different formats, such as JSON, YAML, and CSV.
cricketdata
The cricketdata
(open-source) package aims to be a
one-stop shop for most cricket data from all primary sources, available
in an accessible form and ready for analysis. Different functions in the
package allow us to download the data from Cricinfo and cricsheet as a
data frame (tibble) in R. The user can access data from different
formats of the game, e,g, tests, odis, international T20, league T20,
etc. In particular, the
cricWAR https://dazzalytics.shinyapps.io/cricwar/ is an example
of sports analytic project based on cricketdata
resources.
cricketdata
as an open-source project is inspired
primarily from the open-source work done by Rstats
community and sports analytics projects such as nflfastR
(Carl and Baldwin, n.d.), sportsdataverse
(Gilani, n.d.).
In the following sections, we will show how to install the package and take full advantage of the package functionality with numerous examples.
cricketdata
is available on CRAN and the stable
version can be installed.
You may also download the development version from Github
There are six main functions,
fetch_cricinfo()
find_player_id()
fetch_player_data()
fetch_cricsheet()
fetch_player_meta()
update_player_meta()
and a data file containing the player meta data.
player_meta
We show the use of each function with examples below.
fetch_cricinfo()
Fetch team data on international cricket matches provided by ESPNCricinfo. It downloads data for international T20, ODI or Test matches, for men or women, and for batting, bowling or fielding. By default, it downloads career-level statistics for individual players.
Arguments
Women’s T20 Bowling Data
# Looking at data
wt20 %>%
glimpse()
#> Rows: 1,977
#> Columns: 16
#> $ Player <chr> "A Mohammed", "Nida Dar", "EA Perry", "M Schutt", "…
#> $ Country <chr> "West Indies", "Pakistan", "Australia", "Australia"…
#> $ Start <int> 2008, 2010, 2008, 2013, 2007, 2005, 2006, 2008, 201…
#> $ End <int> 2021, 2023, 2023, 2023, 2023, 2023, 2023, 2020, 202…
#> $ Matches <int> 117, 128, 136, 93, 109, 109, 118, 79, 89, 72, 113, …
#> $ Innings <int> 113, 121, 128, 92, 108, 108, 103, 79, 87, 72, 87, 6…
#> $ Overs <dbl> 395.3, 410.2, 392.5, 309.3, 381.5, 381.1, 302.3, 26…
#> $ Maidens <int> 6, 10, 8, 7, 20, 17, 6, 10, 11, 5, 4, 10, 7, 9, 6, …
#> $ Runs <int> 2206, 2231, 2297, 1916, 2191, 2102, 1920, 1587, 190…
#> $ Wickets <int> 125, 123, 121, 121, 117, 112, 110, 102, 100, 98, 98…
#> $ Average <dbl> 17.64800, 18.13821, 18.98347, 15.83471, 18.72650, 1…
#> $ Economy <dbl> 5.577750, 5.437043, 5.847263, 6.190630, 5.738106, 5…
#> $ StrikeRate <dbl> 18.98400, 20.01626, 19.47934, 15.34711, 19.58120, 2…
#> $ BestBowlingInnings <chr> "5/10", "5/21", "4/12", "5/15", "5/12", "4/15", "4/…
#> $ FourWickets <int> 4, 1, 4, 4, 0, 1, 1, 2, 1, 3, 2, 1, 3, 4, 1, 1, 4, …
#> $ FiveWickets <int> 3, 1, 0, 1, 2, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, …
# Table showing certain features of the data
wt20 %>%
select(Player, Country, Matches, Runs, Wickets, Economy, StrikeRate) %>%
head() %>%
knitr::kable(
digits = 2, align = "c",
caption = "Women Player career profile for international T20"
)
Player | Country | Matches | Runs | Wickets | Economy | StrikeRate |
---|---|---|---|---|---|---|
A Mohammed | West Indies | 117 | 2206 | 125 | 5.58 | 18.98 |
Nida Dar | Pakistan | 128 | 2231 | 123 | 5.44 | 20.02 |
EA Perry | Australia | 136 | 2297 | 121 | 5.85 | 19.48 |
M Schutt | Australia | 93 | 1916 | 121 | 6.19 | 15.35 |
S Ismail | South Africa | 109 | 2191 | 117 | 5.74 | 19.58 |
KH Brunt | England | 109 | 2102 | 112 | 5.51 | 20.42 |
# Plotting Data
wt20 %>%
filter(Wickets >= 50) %>%
ggplot(aes(y = StrikeRate, x = Average)) +
geom_point(alpha = 0.3, col = "blue") +
ggtitle("Women International T20 Bowlers") +
ylab("Balls bowled per wicket") +
xlab("Runs conceded per wicket")
USA men’s ODI data by innings
# Fetch all USA Men's ODI data by innings
menODI <- fetch_cricinfo("ODI", "Men", "Batting",
type = "innings",
country = "United States of America"
)
# Table of USA player who have scored a century
menODI %>%
filter(Runs >= 100) %>%
select(Player, Runs, BallsFaced, Fours, Sixes, Opposition) %>%
knitr::kable(digits = 2)
Player | Runs | BallsFaced | Fours | Sixes | Opposition |
---|---|---|---|---|---|
JS Malhotra | 173 | 124 | 4 | 16 | Papau New Guinea |
MD Patel | 130 | 101 | 11 | 6 | Oman |
Aaron Jones | 123 | 87 | 9 | 6 | Scotland |
SR Taylor | 114 | 123 | 11 | 3 | Nepal |
SJ Modani | 111 | 133 | 9 | 0 | Oman |
MD Patel | 100 | 114 | 9 | 1 | Nepal |
fetch_player_id
Each player has a player id on ESPNCricinfo, which is useful to access a individual player’s data. This function given a string of players name or part of the name would return the name of corresponding player(s), their cricinfo id(s), and some other information.
Argument
fetch_player_data
Fetch individual player data from all matches played. The function
will scrape the data from ESPNCricinfo and return a tibble with one line
per innings for all games a player has played. To identify a player, use
their Cricinfo player ID. The simplest way to find this is to look up
their Cricinfo Profile page. The number at the end of the URL is the ID.
For example, Meg Lanning’s profile page is https://www.espncricinfo.com/cricketers/meg-lanning-329336,
so her ID is 329336. Or you may use the find_player_id
function.
Argument
# Fetching the player Meg Lanning's playing data
MegLanning <- fetch_player_data(meg_lanning_id, "ODI") %>%
mutate(NotOut = (Dismissal == "not out"))
dim(MegLanning)
#> [1] 103 14
names(MegLanning)
#> [1] "Date" "Innings" "Opposition" "Ground" "Runs"
#> [6] "Mins" "BF" "X4s" "X6s" "SR"
#> [11] "Pos" "Dismissal" "Inns" "NotOut"
# Compute batting average
MLave <- MegLanning %>%
filter(!is.na(Runs)) %>%
summarise(Average = sum(Runs) / (n() - sum(NotOut))) %>%
pull(Average)
names(MLave) <- paste("Average =", round(MLave, 2))
# Plot ODI scores
ggplot(MegLanning) +
geom_hline(aes(yintercept = MLave), col = "gray") +
geom_point(aes(x = Date, y = Runs, col = NotOut)) +
ggtitle("Meg Lanning ODI Scores") +
scale_y_continuous(sec.axis = sec_axis(~., breaks = MLave))
fetch_cricsheet()
Cricsheet is the only open
accessible source for cricket ball-by-ball data.
fetch_cricsheet()
download csv data from cricsheet. Data
must be specified by three factors: (a) type of data: bbb
(ball-by-ball), match or player. (b) gender; (c) competition. See https://cricsheet.org/downloads/ for what the
competition character codes mean.
The raw T20 data from cricsheet is further processed to add more columns (features) to facilitate analysis.
Arguments
type: Character string giving type of data: ball-by-ball, match info or player info.
gender: Character string giving player gender: female or male.
competition: Character string giving name of competition. e.g. ipl for Indiana Premier League, psl for Pakistan Super League, tests for international test matches, etc.
Indian Premier League (IPL) Ball-by-Ball Data
ipl_bbb %>%
glimpse()
#> Rows: 225,954
#> Columns: 33
#> $ match_id <int> 335982, 335982, 335982, 335982, 335982, 335982,…
#> $ season <chr> "2007/08", "2007/08", "2007/08", "2007/08", "20…
#> $ start_date <chr> "2008-04-18", "2008-04-18", "2008-04-18", "2008…
#> $ venue <chr> "M Chinnaswamy Stadium", "M Chinnaswamy Stadium…
#> $ innings <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ over <dbl> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3,…
#> $ ball <int> 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 1, 2, 3,…
#> $ batting_team <chr> "Kolkata Knight Riders", "Kolkata Knight Riders…
#> $ bowling_team <chr> "Royal Challengers Bangalore", "Royal Challenge…
#> $ striker <chr> "SC Ganguly", "BB McCullum", "BB McCullum", "BB…
#> $ non_striker <chr> "BB McCullum", "SC Ganguly", "SC Ganguly", "SC …
#> $ bowler <chr> "P Kumar", "P Kumar", "P Kumar", "P Kumar", "P …
#> $ runs_off_bat <int> 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 6, 4, 0, 0, 0, 0,…
#> $ extras <int> 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
#> $ ball_in_over <int> 1, 2, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1, 2, 3,…
#> $ extra_ball <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE,…
#> $ balls_remaining <dbl> 119, 118, 118, 117, 116, 115, 114, 113, 112, 11…
#> $ runs_scored_yet <int> 1, 1, 2, 2, 2, 2, 3, 3, 7, 11, 17, 21, 21, 21, …
#> $ wicket <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ wickets_lost_yet <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ innings1_total <int> 222, 222, 222, 222, 222, 222, 222, 222, 222, 22…
#> $ innings2_total <int> 82, 82, 82, 82, 82, 82, 82, 82, 82, 82, 82, 82,…
#> $ target <dbl> 223, 223, 223, 223, 223, 223, 223, 223, 223, 22…
#> $ wides <int> NA, NA, 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ noballs <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ byes <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ legbyes <int> 1, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, N…
#> $ penalty <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ wicket_type <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
#> $ player_dismissed <chr> "", "", "", "", "", "", "", "", "", "", "", "",…
#> $ other_wicket_type <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ other_player_dismissed <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ .groups <chr> "drop", "drop", "drop", "drop", "drop", "drop",…
# Top 20 batters wrt Boundary and Dot % in IPL 2022 season
ipl_bbb %>%
filter(season == "2022") %>%
group_by(striker) %>%
summarize(
Runs = sum(runs_off_bat), BallsFaced = n() - sum(!is.na(wides)),
StrikeRate = Runs / BallsFaced, DotPercent = sum(runs_off_bat == 0) * 100 / BallsFaced,
BoundaryPercent = sum(runs_off_bat %in% c(4, 6)) * 100 / BallsFaced
) %>%
arrange(desc(Runs)) %>%
rename(Batter = striker) %>%
slice(1:20) %>%
ggplot(aes(y = BoundaryPercent, x = DotPercent, size = BallsFaced)) +
geom_point(color = "red", alpha = 0.3) +
geom_text(aes(label = Batter),
vjust = -0.5, hjust = 0.5, color = "#013369",
position = position_dodge(0.9), size = 3
) +
ylab("Boundary Percent") +
xlab("Dot Percent") +
ggtitle("IPL 2022: Top 20 Batters")
# Top 10 prolific batters in IPL 2022 season.
ipl_bbb %>%
filter(season == "2022") %>%
group_by(striker) %>%
summarize(
Runs = sum(runs_off_bat), BallsFaced = n() - sum(!is.na(wides)),
StrikeRate = Runs / BallsFaced,
DotPercent = sum(runs_off_bat == 0) * 100 / BallsFaced,
BoundaryPercent = sum(runs_off_bat %in% c(4, 6)) * 100 / BallsFaced
) %>%
arrange(desc(Runs)) %>%
rename(Batter = striker) %>%
slice(1:10) %>%
knitr::kable(digits = 1, align = "c")
Batter | Runs | BallsFaced | StrikeRate | DotPercent | BoundaryPercent |
---|---|---|---|---|---|
JC Buttler | 863 | 579 | 1.5 | 42.7 | 22.3 |
KL Rahul | 616 | 455 | 1.4 | 36.9 | 16.5 |
Q de Kock | 508 | 341 | 1.5 | 36.4 | 20.5 |
HH Pandya | 487 | 371 | 1.3 | 36.4 | 16.4 |
Shubman Gill | 483 | 365 | 1.3 | 34.5 | 17.0 |
DA Miller | 481 | 337 | 1.4 | 31.5 | 16.3 |
F du Plessis | 468 | 367 | 1.3 | 42.2 | 16.9 |
S Dhawan | 460 | 375 | 1.2 | 42.9 | 15.7 |
SV Samson | 458 | 312 | 1.5 | 44.2 | 22.1 |
DJ Hooda | 451 | 330 | 1.4 | 34.5 | 16.4 |
player_meta
It is a data set containing player’s and cricket officials meta data such as full name, country of representation, data of birth, bowling and batting hand, bowling style, and playing role. More than 11,000 player’s and officials data is available. This data was scraped from ESPNCricinfo website.
player_meta %>%
filter(!is.na(playing_role)) %>%
select(-cricinfo_id, -unique_name) %>%
head() %>%
knitr::kable(
digits = 1, align = "c", format = "pipe",
col.names = c(
"ID", "FullName", "Country", "DOB", "BirthPlace",
"BattingStyle", "BowlingStyle", "PlayingRole"
)
)
ID | FullName | Country | DOB | BirthPlace | BattingStyle | BowlingStyle | PlayingRole |
---|---|---|---|---|---|---|---|
9dbc77b3 | Aaftab Alam Khan | Malta | 1986-01-31 | NA | Right hand Bat | Right arm Medium fast | Wicketkeeper Batter |
797f52cc | Aahan Gopinath Achar | Singapore | 1999-03-30 | NA | Left hand Bat | Slow Left arm Orthodox | Bowler |
e249fdaa | Aakash Chopra | India | 1977-09-19 | Agra, Uttar Pradesh | Right hand Bat | Right arm Medium, Right arm Offbreak | Batter |
4b0e3049 | Aaliyah Alicia Alleyne | West Indies | 1994-11-11 | NA | Right hand Bat | Right arm Medium | Bowler |
f1733e13 | Aaliyah Williams | West Indies | 1998-02-28 | NA | Right hand Bat | Right arm Medium | Allrounder |
a8e54ef4 | Aamer Jamal | Pakistan | 1996-07-05 | Mianwali | Right hand Bat | Right arm Medium | Allrounder |
fetch_player_meta()
Fetch the player’s meta data such as full name, country of representation, data of birth, bowling and batting hand, bowling style, and playing role. This meta data is useful for advance modeling, e,g, age curves, batter profile against bowling types etc.
Argument
The cricinfo player ids can be accessed in multiple ways, e.g. use
fetch_player_id() function, get the id from the player’s cricinfo page
or consult the player_meta
data frame which has player meta
data of more than 11,000 players.
# Download meta data on Meg Lanning and Ellyse Perry
aus_women <- fetch_player_meta(c(329336, 275487))
aus_women %>%
knitr::kable(
digits = 1, align = "c", format = "pipe",
col.names = c(
"ID", "FullName", "Country", "DOB", "BirthPlace", "BattingStyle",
"BowlingStyle", "PlayingRole"
)
)
ID | FullName | Country | DOB | BirthPlace | BattingStyle | BowlingStyle | PlayingRole |
---|---|---|---|---|---|---|---|
329336 | Meghann Moira Lanning | Australia | 1992-03-25 | Singapore | Right hand Bat | Right arm Medium | Top order Batter |
275487 | Ellyse Alexandra Perry | Australia | 1990-11-03 | Wahroonga, Sydney, New South Wales | Right hand Bat | Right arm Fast medium | Allrounder |
update_player_meta()
This function is supposed to consult the directory of all players
available on cricsheet website and include the meta data of new players
into the player_meta
data frame. The data for new players
will be scraped from the ESPNCricinfo.