`RAthena`’s help

For a lot of users, Apache Parquet or Apache ORC are file formats that aren’t well known and as a result alto systems don’t have the software to create these formats. RAthena offers some assists by firstly enabling apache parquet format to be uploaded through dbWriteTable, using the R package arrow to create the parquet format.

If uploading Apache Parquet is not possible or if the file format Apache ORC is preferred then RAthena offers another solution. RAthena can utilise the power of AWS Athena to convert file formats for you. What this allows you to do is:

Upload Data in an easier file format for example delimited format
Convert Data into Parquet or ORC using AWS Athena to save cost
Finally insert into final table with ETL processes

Upload Data in delimited format

Uploading Data in delimited format is the easiest method.

library(DBI)
library(RAthena)

con <- dbConnect(athena())

# create a temporary database to upload data into
res <- dbExecute(con, "CREATE IF NOT EXISTS DATABASE temp")
dbClearResult(res)

iris2 <- iris
iris2$time_stamp <- format(Sys.Date(), "%Y%m%d")

dbWriteTable(con, "temp.iris_delim", iris2)

However delimited file format isn’t the most cost effective when it comes to using AWS Athena. To overcome this we can convert this by using AWS Athena.

Convert Data into Parquet or ORC

Converting table to a non-partitioned Parquet or ORC format.

# convert to parquet
dbConvertTable(con,
               obj = "temp.iris_delim",
               name = "iris_parquet",
               file.type = "parquet")

# convert to orc
dbConvertTable(con,
               obj = "temp.iris_delim",
               name = "iris_orc",
               file.type = "orc")

NOTE: By default dbConvertTable compresses Parquet/ ORC format using snappy compression.

RAthena goes a step further by allowing tables to be converted with partitions.

# convert to parquet with partition time_stamp
dbConvertTable(con,
               obj = "temp.iris_delim",
               name = "iris_parquet_partition",
               partition = "time_stamp",
               file.type = "parquet")

RAthena even allows SQL queries to be converted into desired file format:

dbConvertTable(con,
              obj = SQL("select 
                          Sepal_Length,
                          Sepal_Width,
                          date_format(current_date, '%Y%m%d') as time_stamp 
                        from temp.iris_delim"),
              name = "iris_orc_partition",
              partition = "time_stamp",
              file.type = "orc")

Insert into table for ETL processes

As we have created partitioned data, we can easily insert into:

res <- 
  dbExecute(con, "insert into iris_orc_partition
                  select 
                    Sepal_Length,
                    Sepal_Width, 
                    date_format(date_add('date', 1, current_date) , '%Y%m%d') time_stamp 
                  from temp.iris_delim")
dbClearResult(res)

What this all means is that you can create ETL processes by uploading data in basic file format (delimited), and then converting / inserting into the prefer file format.

`dplyr` method

The good news doesn’t stop there, RAthena integrates with dplyr to allow converting to be done through dplyr.

library(dplyr)

iris_tbl <- tbl(con, dbplyr::in_schema("temp", "iris_delim"))

r_date <- format(Sys.Date(), "%Y%m%d")

iris_tbl %>% 
  select(petal_length,
         petal_width) %>% 
  mutate(time_stamp = r_date) %>%
  compute("iris_dplyr_parquet", partition = "time_stamp", file_type = "parquet")

Convert and Save Cost

Dyfan Jones

Pricing Details

`RAthena`’s help

Upload Data in delimited format

Convert Data into Parquet or ORC

Insert into table for ETL processes

`dplyr` method

Reading Material

Convert and Save Cost

Dyfan Jones

Pricing Details

RAthena’s help

Upload Data in delimited format

Convert Data into Parquet or ORC

Insert into table for ETL processes

dplyr method

Reading Material

`RAthena`’s help

`dplyr` method