Once we have star databases built with the data available at the moment, periodically we may obtain additional data, with the same structure as the initial data but from a later time or from another place. Sometimes the new data also contains instances which are already included in the star database to operate on them.
Suppose we need to jointly analyse all the data available at the moment: We must include them in the star database. One possibility is to integrate all available data into a flat table and build the star database again. Another possibility is to use the incremental refresh mechanism described in this document.
This document shows by means of an example the possibilities offered by the package in this context. First, the starting data sets are presented. The next section shows how to generate the refresh structures. The following section shows how to perform the incremental refresh. Finally, we present how to make changes to the transformation functions and add these changes for future refresh operations. Finish with the conclusions.
The starting data set is the content in the variable
mrs_db
, obtained in the vignette titled Obtaining and
transforming flat tables,
vignette("v05-flat-table-op")
. It contains the
constellation, formed by two star databases. Next we get their
names.
The code to generate the constellation from the initial data is available in the vignette. Below is a graphic representation of the tables that make it up.
db_dm <- mrs_db |>
as_dm_class(pk_facts = FALSE)
db_dm |>
dm::dm_draw(rankdir = "LR", view_type = "all")
From the original data source (the Deaths in 122 U.S. cities - 1962-2016. 122 Cities Mortality Reporting System dataset), suppose we obtain a set of data that we want to integrate with the previous data to analyse them together.
We have stored it in the package, in a file with the same format as
the original file which only contains a small portion of the original
data. We have made sure that there is data that was already included in
the data set considered to obtain the content of the mrs_db
constellation and also new data. It is accessed below.
file <-
system.file(
"extdata/mrs",
"mrs_122_us_cities_1962_2016_new.csv",
package = "rolap"
)
mrs_ft_new <-
read_flat_table_file(name = 'mrs new', file)
Using the read_flat_table_file()
function we read a
table stored in a text file and create a flat_table
object
with a name. Below are the first records of the table. We access the
table using the get_table()
function for the object of the
flat_table
class.
Year | WEEK | Week Ending Date | REGION | State | City | Pneumonia and Influenza Deaths | All Deaths | <1 year (all cause deaths) | 1-24 years (all cause deaths) | 25-44 years | 45-64 years (all cause deaths) | 65+ years (all cause deaths) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1962 | 48 | 12/01/1962 | 2 | NY | Buffalo | 5 | 148 | 10 | 1 | 11 | 31 | 95 |
1963 | 3 | 01/19/1963 | 4 | IA | Des Moines | 2 | 55 | 1 | 3 | 6 | 17 | 28 |
1963 | 6 | 02/09/1963 | 8 | CO | Pueblo | 0 | 21 | 0 | 0 | 1 | 4 | 16 |
1963 | 7 | 02/16/1963 | 7 | TX | El Paso | 0 | 39 | 10 | 2 | 3 | 9 | 15 |
1963 | 25 | 06/22/1963 | 1 | MA | Springfield | 5 | 35 | 0 | 0 | 1 | 9 | 25 |
1964 | 10 | 03/07/1964 | 1 | MA | Cambridge | 3 | 31 | 0 | 0 | 3 | 10 | 18 |
From the table that we have read, with the information contained in the constellation variable, we can automatically generate refresh structures for the star databases that make up the constellation.
From the data read in the form of a flat_table
object,
using the update_according_to()
function, we generate a
refresh structure of the star database of the mrs_db
constellation whose name is indicated.
The modification structure contains a star database generated from the read data. We can see its graphical representation below.
db_dm <- mrs_db_age_refresh |>
get_star_database() |>
as_dm_class(pk_facts = FALSE)
db_dm |>
dm::dm_draw(rankdir = "LR", view_type = "all")
In exactly the same way, we generate the refresh structure for the other star database.
Below is its graphical representation.
The transformation code to generate the variable mrs_db
can be obtained in the vignette titled Obtaining and transforming
flat tables, vignette("v05-flat-table-op")
.
It is also included in the refresh structure for each of the star
databases obtained. It can be obtained using the
get_transformation_code()
and
get_transformation_file()
functions, in the form of a
vector of strings and a file respectively.
mrs_db_age_refresh |>
get_transformation_code()
#> [1] "transform_instance_table <- function(instance_df, lookup_ft, definition_fun, star_sch) {"
#> [2] " ft <- "
#> [3] " flat_table("
#> [4] " name = 'mrs',"
#> [5] " instances = instance_df,"
#> [6] " unknown_value = 'Not available'"
#> [7] " ) |>"
#> [8] " transform_to_measure("
#> [9] " attributes = c('Pneumonia and Influenza Deaths', 'All Deaths', '<1 year (all cause deaths)', '1-24 years (all cause deaths)', '25-44 years', '45-64 years (all cause deaths)', '65+ years (all cause deaths)'),"
#> [10] " k_sep = NULL,"
#> [11] " decimal_sep = NULL"
#> [12] " ) |>"
#> [13] " transform_attribute_format("
#> [14] " attributes = 'WEEK',"
#> [15] " width = 2,"
#> [16] " decimal_places = 0,"
#> [17] " k_sep = ',',"
#> [18] " decimal_sep = '.',"
#> [19] " space_filling = TRUE"
#> [20] " ) |>"
#> [21] " replace_empty_values("
#> [22] " attributes = c('Year', 'WEEK', 'Week Ending Date', 'REGION', 'State', 'City'),"
#> [23] " empty_values = NULL"
#> [24] " ) |>"
#> [25] " add_custom_column("
#> [26] " name = 'city_state',"
#> [27] " definition = definition_fun"
#> [28] " ) |>"
#> [29] " replace_attribute_values("
#> [30] " attributes = c('City', 'city_state'),"
#> [31] " old = c('Wilimington', 'Wilimington DE'),"
#> [32] " new = c('Wilmington', 'Wilmington DE')"
#> [33] " ) |>"
#> [34] " join_lookup_table("
#> [35] " fk_attributes = 'city_state',"
#> [36] " lookup = lookup_ft"
#> [37] " ) |>"
#> [38] " select_attributes("
#> [39] " attributes = c('Year', 'WEEK', 'Week Ending Date', 'REGION', 'State', 'City', 'city_state', 'status', 'pop', 'lat', 'long')"
#> [40] " ) |>"
#> [41] " separate_measures("
#> [42] " measures = list(c('Pneumonia and Influenza Deaths', 'All Deaths'), c('<1 year (all cause deaths)', '1-24 years (all cause deaths)', '25-44 years', '45-64 years (all cause deaths)', '65+ years (all cause deaths)')),"
#> [43] " names = c('mrs_cause', 'mrs_age'),"
#> [44] " na_rm = TRUE"
#> [45] " ) |>"
#> [46] " magrittr::extract2('mrs_age') |>"
#> [47] " transform_to_values("
#> [48] " attribute = 'age',"
#> [49] " measure = 'all_deaths',"
#> [50] " id_reverse = NULL,"
#> [51] " na_rm = TRUE"
#> [52] " ) |>"
#> [53] " snake_case("
#> [54] " ) |>"
#> [55] " replace_string("
#> [56] " attributes = 'age',"
#> [57] " string = ' (all cause deaths)',"
#> [58] " replacement = NULL"
#> [59] " ) |>"
#> [60] " as_star_database("
#> [61] " schema = star_sch"
#> [62] " )"
#> [63] ""
#> [64] " ft"
#> [65] "}"
#> [66] ""
#> [67] "ft <- transform_instance_table(instance_df, lookup_ft, definition_fun, star_sch)"
If additional transformations are needed, it can be modified to our convenience. In the same way we consult it for the other star database, although we don’t show the result here.
We can see that the selection of the star database is done using the
magrittr::extract2()
function, which was not used in the
original vignette to select the elements of a list. It has been included
here to preserve the pipe syntax throughout the transformation.
This code only needs to be used if we want to perform additional transformations. If the previously defined transformations are sufficient, they have already been automatically applied to the data in order to integrate them.
Seeing the new instances that are going to be added to the dimensions
can help us determine whether or not we need to modify the
transformation code. We can do this using the
get_new_dimension_instances()
function, as shown below.
mrs_db_age_refresh |>
get_new_dimension_instances()
#> $when
#> # A tibble: 110 × 3
#> year week week_ending_date
#> <chr> <chr> <chr>
#> 1 1962 "48" 12/01/1962
#> 2 1963 " 3" 01/19/1963
#> 3 1963 " 7" 02/16/1963
#> 4 1964 "10" 03/07/1964
#> 5 1964 "12" 03/21/1964
#> 6 1964 "20" 05/16/1964
#> 7 1966 "26" 07/02/1966
#> 8 1966 "47" 11/26/1966
#> 9 1967 "10" 03/11/1967
#> 10 1967 "22" 06/03/1967
#> # ℹ 100 more rows
#>
#> $where
#> # A tibble: 2 × 8
#> region state city city_state status pop lat long
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 MA Boston Boston MA state capital " 567,759" 42.3 " -71.0"
#> 2 5 MD Baltimore Baltimore MD non-capital " 602,658" 39.3 " -76.6"
If necessary, starting from the code that we have obtained, we can
perform transformations on the starting flat_table
structure or on the star_database
object obtained.
We can see that there are two new cities that were not included in the where dimension of the initial star databases: Baltimore and Boston.
The most common thing is that refresh operations only include new instances in fact tables, but it may be the case that repeated instances appear: They may have different values in the measures, but the same values in the dimension foreign keys.
To perform the incremental refresh operation, we must determine what
happens to the fact table instances that were already included in the
original star database. We can query existing fact instances using the
get_existing_fact_instances()
function, as can be seen
below for each of the star databases.
mrs_db_age_refresh |>
get_existing_fact_instances()
#> # A tibble: 200 × 14
#> year week week_ending_date region state city city_state status pop lat
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1963 "25" 06/22/1963 1 MA Spri… Springfie… non-c… " 1… 42.1
#> 2 1963 "25" 06/22/1963 1 MA Spri… Springfie… non-c… " 1… 42.1
#> 3 1963 "25" 06/22/1963 1 MA Spri… Springfie… non-c… " 1… 42.1
#> 4 1963 "25" 06/22/1963 1 MA Spri… Springfie… non-c… " 1… 42.1
#> 5 1963 "25" 06/22/1963 1 MA Spri… Springfie… non-c… " 1… 42.1
#> 6 1965 " 5" 02/06/1965 4 KS Wich… Wichita KS non-c… " 3… 37.7
#> 7 1965 " 5" 02/06/1965 4 KS Wich… Wichita KS non-c… " 3… 37.7
#> 8 1965 " 5" 02/06/1965 4 KS Wich… Wichita KS non-c… " 3… 37.7
#> 9 1965 " 5" 02/06/1965 4 KS Wich… Wichita KS non-c… " 3… 37.7
#> 10 1965 " 5" 02/06/1965 4 KS Wich… Wichita KS non-c… " 3… 37.7
#> # ℹ 190 more rows
#> # ℹ 4 more variables: long <chr>, age <chr>, all_deaths <int>, nrow_agg <int>
mrs_db_cause_refresh |>
get_existing_fact_instances()
#> # A tibble: 40 × 14
#> year week week_ending_date region state city city_state status pop lat
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1963 "25" 06/22/1963 1 MA Spri… Springfie… non-c… " 1… 42.1
#> 2 1965 " 5" 02/06/1965 4 KS Wich… Wichita KS non-c… " 3… 37.7
#> 3 1966 "35" 09/03/1966 9 CA Long… Long Beac… non-c… " 4… 33.8
#> 4 1966 "41" 10/15/1966 9 CA San … San Jose … non-c… " 8… 37.3
#> 5 1968 "29" 07/20/1968 3 OH Youn… Youngstow… non-c… " … 41.1
#> 6 1971 "28" 07/17/1971 3 IL Chic… Chicago IL non-c… "2,8… 41.8
#> 7 1973 " 6" 02/10/1973 2 NY Syra… Syracuse … non-c… " 1… 43.0
#> 8 1974 "49" 12/07/1974 5 FL Tampa Tampa FL non-c… " 3… 28.0
#> 9 1976 "48" 12/04/1976 9 CA Fres… Fresno CA non-c… " 4… 36.8
#> 10 1977 "21" 05/28/1977 6 TN Memp… Memphis TN non-c… " 6… 35.1
#> # ℹ 30 more rows
#> # ℹ 4 more variables: long <chr>, pneumonia_and_influenza_deaths <int>,
#> # all_deaths <int>, nrow_agg <int>
In each case it may be interesting to perform a different operation
on these previously existing instances. By default, what is done is to
ignore them but we can indicate that they are grouped, replaced or
deleted, using the existing_instances
parameter.
To better appreciate the update made, we are going to perform the same query before and after the incremental refresh operation: We obtain the design tables, along with the number of instances of each one. For the where dimension table, we get the names of the first cities to check that Baltimore and Boston were not included.
l_db <- mrs_db |>
as_tibble_list()
names <- names(l_db)
for (i in seq_along(l_db)){
cat(sprintf("name: %s, %d rows\n", names[i], nrow(l_db[[i]])))
}
#> name: when, 1966 rows
#> name: where, 120 rows
#> name: who, 5 rows
#> name: mrs_cause, 3342 rows
#> name: mrs_age, 16565 rows
head(sort(l_db[['where']]$city), 15)
#> [1] "Akron" "Albany" "Albuquerque" "Allentown" "Atlanta"
#> [6] "Austin" "Baton Rouge" "Berkeley" "Birmingham" "Boise"
#> [11] "Bridgeport" "Buffalo" "Cambridge" "Camden" "Canton"
Next, we perform the incremental refresh for each of the star
databases in the original constellation. For one of the star databases,
through the existing_instances
parameter, it has been
indicated that the instances of previously existing facts are grouped
with the new ones. For the other design, the default option is
considered: Ignore fact table previously existing instances.
We will make a copy of the constellation to perform an operation later.
mrs_db_seg <- mrs_db
mrs_db <- mrs_db |>
incremental_refresh(mrs_db_age_refresh) |>
incremental_refresh(mrs_db_cause_refresh, existing_instances = "group")
Finally, we consult the same data as before again.
l_db <- mrs_db |>
as_tibble_list()
names <- names(l_db)
for (i in seq_along(l_db)){
cat(sprintf("name: %s, %d rows\n", names[i], nrow(l_db[[i]])))
}
#> name: when, 2076 rows
#> name: where, 122 rows
#> name: who, 5 rows
#> name: mrs_cause, 3677 rows
#> name: mrs_age, 18228 rows
head(sort(l_db[['where']]$city), 15)
#> [1] "Akron" "Albany" "Albuquerque" "Allentown" "Atlanta"
#> [6] "Austin" "Baltimore" "Baton Rouge" "Berkeley" "Birmingham"
#> [11] "Boise" "Boston" "Bridgeport" "Buffalo" "Cambridge"
We observe how the number of instances has increased in the facts and in the dimensions and the where dimension already contains the names of the cities that appeared as new in the update operation.
In this case no problem has been detected in the new dimension data. But let’s assume that we wanted to make some change to the modify operations before doing the update to show how it would be done.
First of all, we obtain the transformation function. In this case we
have stored it in a temporary file using the
get_transformation_file()
function and we have copied it
here.
Starting from the tibble
of the initial
flat_table
, the star_schema
to define the
star_database
, the lookup tables and field definition
functions that we have used, we can reproduce all the transformations
and adapt them to the new situations that hypothetically have
arisen.
transform_instance_table <-
function(instance_df,
lookup_ft,
definition_fun,
star_sch) {
ft <-
flat_table(name = 'mrs',
instances = instance_df,
unknown_value = 'Not available') |>
transform_to_measure(
attributes = c(
'Pneumonia and Influenza Deaths',
'All Deaths',
'<1 year (all cause deaths)',
'1-24 years (all cause deaths)',
'25-44 years',
'45-64 years (all cause deaths)',
'65+ years (all cause deaths)'
),
k_sep = NULL,
decimal_sep = NULL
) |>
transform_attribute_format(
attributes = 'WEEK',
width = 2,
decimal_places = 0,
k_sep = ',',
decimal_sep = '.'
) |>
replace_empty_values(
attributes = c('Year', 'WEEK', 'Week Ending Date', 'REGION', 'State', 'City'),
empty_values = NULL
) |>
add_custom_column(name = 'city_state',
definition = definition_fun) |>
replace_attribute_values(
attributes = c('City', 'city_state'),
old = c('Wilimington', 'Wilimington DE'),
new = c('Wilmington', 'Wilmington DE')
) |>
join_lookup_table(fk_attributes = 'city_state',
lookup = lookup_ft) |>
select_attributes(
attributes = c(
'Year',
'WEEK',
'Week Ending Date',
'REGION',
'State',
'City',
'city_state',
'status',
'pop',
'lat',
'long'
)
) |>
separate_measures(
measures = list(
c('Pneumonia and Influenza Deaths', 'All Deaths'),
c(
'<1 year (all cause deaths)',
'1-24 years (all cause deaths)',
'25-44 years',
'45-64 years (all cause deaths)',
'65+ years (all cause deaths)'
)
),
names = c('mrs_cause', 'mrs_age'),
na_rm = TRUE
) |>
magrittr::extract2('mrs_cause') |>
snake_case() |>
as_star_database(schema = star_sch)
ft
}
We can get these parameters from the initial definition (in the
vignette titled Obtaining and transforming flat tables,
vignette("v05-flat-table-op")
) or by consulting the refresh
structure with the functions available for this purpose.
instance_df <- mrs_ft_new |>
get_table()
lookup_list <- mrs_db_cause_refresh |>
get_lookup_tables()
star_sch <- mrs_db_cause_refresh |>
get_star_schema()
# function to define a derived column
city_state <- function(table) {
paste0(table$City, ' ', table$State)
}
mrs_db_cause_transf <-
transform_instance_table(
instance_df = instance_df,
lookup_ft = lookup_list[['us_cities']],
definition_fun = city_state,
star_sch = star_sch
)
We perform the same operation for the other star database.
transform_instance_table_2 <-
function(instance_df,
lookup_ft,
definition_fun,
star_sch) {
ft <-
flat_table(name = 'mrs',
instances = instance_df,
unknown_value = 'Not available') |>
transform_to_measure(
attributes = c(
'Pneumonia and Influenza Deaths',
'All Deaths',
'<1 year (all cause deaths)',
'1-24 years (all cause deaths)',
'25-44 years',
'45-64 years (all cause deaths)',
'65+ years (all cause deaths)'
),
k_sep = NULL,
decimal_sep = NULL
) |>
transform_attribute_format(
attributes = 'WEEK',
width = 2,
decimal_places = 0,
k_sep = ',',
decimal_sep = '.'
) |>
replace_empty_values(
attributes = c('Year', 'WEEK', 'Week Ending Date', 'REGION', 'State', 'City'),
empty_values = NULL
) |>
add_custom_column(name = 'city_state',
definition = definition_fun) |>
replace_attribute_values(
attributes = c('City', 'city_state'),
old = c('Wilimington', 'Wilimington DE'),
new = c('Wilmington', 'Wilmington DE')
) |>
join_lookup_table(fk_attributes = 'city_state',
lookup = lookup_ft) |>
select_attributes(
attributes = c(
'Year',
'WEEK',
'Week Ending Date',
'REGION',
'State',
'City',
'city_state',
'status',
'pop',
'lat',
'long'
)
) |>
separate_measures(
measures = list(
c('Pneumonia and Influenza Deaths', 'All Deaths'),
c(
'<1 year (all cause deaths)',
'1-24 years (all cause deaths)',
'25-44 years',
'45-64 years (all cause deaths)',
'65+ years (all cause deaths)'
)
),
names = c('mrs_cause', 'mrs_age'),
na_rm = TRUE
) |>
magrittr::extract2('mrs_age') |>
transform_to_values(
attribute = 'age',
measure = 'all_deaths',
id_reverse = NULL,
na_rm = TRUE
) |>
snake_case() |>
replace_string(attributes = 'age',
string = ' (all cause deaths)',
replacement = NULL) |>
as_star_database(schema = star_sch)
ft
}
star_sch <- mrs_db_age_refresh |>
get_star_schema()
mrs_db_age_transf <-
transform_instance_table_2(
instance_df = instance_df,
lookup_ft = lookup_list[['us_cities']],
definition_fun = city_state,
star_sch = star_sch
)
The result of these functions, in each case, is a
star_database
containing the new transformations.
Now we have to create the refresh structure for each star database,
considering the original constellation (we have made a copy before
updating it in mrs_db_seg
). To create the refresh
structures, we now indicate that the operations are considered from the
star database of the sdb_operations
parameter.
mrs_db_cause_transf_refresh <- mrs_ft_new |>
update_according_to(mrs_db_seg, star = "mrs_cause", sdb_operations = mrs_db_cause_transf)
mrs_db_age_transf_refresh <- mrs_ft_new |>
update_according_to(mrs_db_seg, star = "mrs_age", sdb_operations = mrs_db_age_transf)
That is, with the sdb_operations
parameter, it takes the
data to be refreshed from the constellation, but the modification
operations are taken from this star database (where we have defined the
new operations that have been considered necessary).
To show the result of the incremental refresh operations, we are going to repeat the process carried out previously: We show the name and number of instances of the tables before and after the transformation, as well as the names of the first cities of the where dimension.
l_db <- mrs_db_seg |>
as_tibble_list()
names <- names(l_db)
for (i in seq_along(l_db)){
cat(sprintf("name: %s, %d rows\n", names[i], nrow(l_db[[i]])))
}
#> name: when, 1966 rows
#> name: where, 120 rows
#> name: who, 5 rows
#> name: mrs_cause, 3342 rows
#> name: mrs_age, 16565 rows
head(sort(l_db[['where']]$city), 15)
#> [1] "Akron" "Albany" "Albuquerque" "Allentown" "Atlanta"
#> [6] "Austin" "Baton Rouge" "Berkeley" "Birmingham" "Boise"
#> [11] "Bridgeport" "Buffalo" "Cambridge" "Camden" "Canton"
To perform the incremental refresh, as the refresh structure has new
transformation operations, we can indicate that these become the
transformation operations of the constellation. Thus, in future updates,
they will be taken into account automatically. This is indicated by the
boolean parameter replace_transformations
.
mrs_db_seg <- mrs_db_seg |>
incremental_refresh(mrs_db_age_transf_refresh, replace_transformations = TRUE) |>
incremental_refresh(
mrs_db_cause_transf_refresh,
existing_instances = "group",
replace_transformations = TRUE
)
Finally, we consult the same data as before again.
l_db <- mrs_db_seg |>
as_tibble_list()
names <- names(l_db)
for (i in seq_along(l_db)){
cat(sprintf("name: %s, %d rows\n", names[i], nrow(l_db[[i]])))
}
#> name: when, 2076 rows
#> name: where, 122 rows
#> name: who, 5 rows
#> name: mrs_cause, 3677 rows
#> name: mrs_age, 18228 rows
head(sort(l_db[['where']]$city), 15)
#> [1] "Akron" "Albany" "Albuquerque" "Allentown" "Atlanta"
#> [6] "Austin" "Baltimore" "Baton Rouge" "Berkeley" "Birmingham"
#> [11] "Boise" "Boston" "Bridgeport" "Buffalo" "Cambridge"
It can be seen that we have obtained the same results as before.
This document shows the functions supporting incremental refreshing
of star databases offered by the rolap
package.
Starting from a flat table, the transformation operations through which a star database has been defined are automatically applied. These operations can be obtained and enriched to adapt to new situations that arise. Once the data has been conveniently updated, it can be integrated with the original star database indicating the operation that is considered most appropriate to integrate the data previously existing in the fact table.
This incremental refresh mechanism, although not trivial, facilitates this generally difficult process. It lays the foundations to continue simplifying it and offering new functionalities.