There are several functions in base R to pivot data. These include
the t()
and reshape()
functions. There are
also well-designed functions in the tidyverse for
transposing data. Nevertheless, some users will prefer the syntax of
proc_transpose()
. This function provides control over
output column naming, and an intuitive set of parameters.
To explore proc_transpose
, let’s first create some
sample data:
# Create input data
dat <- read.table(header = TRUE, text = '
Name Subject Semester1 Semester2
Samma Maths 96 94
Sandy English 76 51
Devesh German 76 95
Rakesh Maths 50 63
Priya English 62 80
Kranti Maths 92 92
William German 87 75')
# View data
dat
# Name Subject Semester1 Semester2
# 1 Samma Maths 96 94
# 2 Sandy English 76 51
# 3 Devesh German 76 95
# 4 Rakesh Maths 50 63
# 5 Priya English 62 80
# 6 Kranti Maths 92 92
# 7 William German 87 75
The proc_tranpose()
function may be executed without any
parameters. The default usage will tranpose all numeric variables and
construct generic column names for the new columns:
# No parameters
res <- proc_transpose(dat)
# View result
res
# NAME COL1 COL2 COL3 COL4 COL5 COL6 COL7
# 1 Semester1 96 76 76 50 62 92 87
# 2 Semester2 94 51 95 63 80 92 75
If you didn’t want all numeric variables transposed, you could
specify which variable(s) to transpose using the var
parameter. Multiple variables can be passed as a vector:
You may control the output column names using the
prefix
, suffix
, and name
parameters. The prefix
and suffix
will be used
to construct the new transposed column names. The name
parameter value will be used as the name of the generic “NAME” column,
which identify the var
parameter values:
# With prefix
res <- proc_transpose(dat, name = VarName, prefix = Student)
# View result
res
VarName Student1 Student2 Student3 Student4 Student5 Student6 Student7
1 Semester1 96 76 76 50 62 92 87
2 Semester2 94 51 95 63 80 92 75
Here is the same function call with a suffix
:
# With suffix
res <- proc_transpose(dat, name = VarName, prefix = S, suffix = Score)
# View result
res
VarName S1Score S2Score S3Score S4Score S5Score S6Score S7Score
1 Semester1 96 76 76 50 62 92 87
2 Semester2 94 51 95 63 80 92 75
Note that since a variable name in R cannot start with a number, some sort of prefix is required.
If your data contains a column with appropriate labels for the
transposed columns, you can assign it to the id
parameter.
The id
values will then be used for the new column
names.
# Assign column names from data
res <- proc_transpose(dat, name = VarName, id = Name)
# View result
res
VarName Samma Sandy Devesh Rakesh Priya Kranti William
1 Semester1 96 76 76 50 62 92 87
2 Semester2 94 51 95 63 80 92 75
Using two id
parameters tells the function that you want
columns that are combinations of the two variables:
# Two id variables
res <- proc_transpose(dat, id = v(Name, Subject))
res
# NAME Samma.Maths Sandy.English Devesh.German Rakesh.Maths Priya.English Kranti.Maths William.German
# 1 Semester1 96 76 76 50 62 92 87
# 2 Semester2 94 51 95 63 80 92 75
The default delimiter shown above is a dot (“.”). The delimiter may
be changed with the delimiter
parameter:
The by
parameter tells the function to group by the
by
variable before transposing. As you can see below, the
by
varible is then retained on the output dataset so you
can identify which rows belong to which group.
# By variable
res <- proc_transpose(dat, by = Name, id = Subject, name = Semester)
# View result
res
# Name Semester German English Maths
# 1 Devesh Semester1 76 NA NA
# 2 Devesh Semester2 95 NA NA
# 3 Kranti Semester1 NA NA 92
# 4 Kranti Semester2 NA NA 92
# 5 Priya Semester1 NA 62 NA
# 6 Priya Semester2 NA 80 NA
# 7 Rakesh Semester1 NA NA 50
# 8 Rakesh Semester2 NA NA 63
# 9 Samma Semester1 NA NA 96
# 10 Samma Semester2 NA NA 94
# 11 Sandy Semester1 NA 76 NA
# 12 Sandy Semester2 NA 51 NA
# 13 William Semester1 87 NA NA
# 14 William Semester2 75 NA NA
The by
parameter is a valuable feature of
proc_transpose()
. The by variable can have a significant
effect on the shape of the output data. Let’s see what happens when we
use a different by variable.
# By variable
res <- proc_transpose(dat, by = Subject, id = Name)
# Subject NAME Sandy Priya Devesh William Samma Rakesh Kranti
# 1 English Semester1 76 62 NA NA NA NA NA
# 2 English Semester2 51 80 NA NA NA NA NA
# 3 German Semester1 NA NA 76 87 NA NA NA
# 4 German Semester2 NA NA 95 75 NA NA NA
# 5 Maths Semester1 NA NA NA NA 96 50 92
# 6 Maths Semester2 NA NA NA NA 94 63 92
Now let’s use two by variables:
# Two by variables
res <- proc_transpose(dat, by = v(Name, Subject))
# View results
res
# Name Subject NAME COL1
# 1 Priya English Semester1 62
# 2 Priya English Semester2 80
# 3 Sandy English Semester1 76
# 4 Sandy English Semester2 51
# 5 Devesh German Semester1 76
# 6 Devesh German Semester2 95
# 7 William German Semester1 87
# 8 William German Semester2 75
# 9 Kranti Maths Semester1 92
# 10 Kranti Maths Semester2 92
# 11 Rakesh Maths Semester1 50
# 12 Rakesh Maths Semester2 63
# 13 Samma Maths Semester1 96
# 14 Samma Maths Semester2 94
By transposing one more time on the results of the previous example, you can nearly restore the original data frame.
# Restore original data shape
res2 <- proc_transpose(res, by = v(Name, Subject), id = NAME)
# View results
res2[ , c("Name", "Subject", "Semester1", "Semester2")]
# Name Subject Semester1 Semester2
# 1 Priya English 62 80
# 2 Sandy English 76 51
# 3 Devesh German 76 95
# 4 William German 87 75
# 5 Kranti Maths 92 92
# 6 Rakesh Maths 50 63
# 7 Samma Maths 96 94
The proc_sort()
function in the procs
package attempts to recreate the most basic functionality of PROC SORT.
The function accepts an input data frame and returns the sorted result.
Before examining the function, first let’s create some sample data:
# Create sample data
dat <- read.table(header = TRUE, text = '
ID Name Score
1 David 74
2 Sam 45
3 Bane 87
3 Mary 92
4 Dane 23
5 Jenny 87
6 Simran 63
8 Priya 72
1 David 45
2 Ram 54
3 Bane 87
5 Ken 87')
# View data
dat
# ID Name Score
# 1 1 David 74
# 2 2 Sam 45
# 3 3 Bane 87
# 4 3 Mary 92
# 5 4 Dane 23
# 6 5 Jenny 87
# 7 6 Simran 63
# 8 8 Priya 72
# 9 1 David 45
# 10 2 Ram 54
# 11 3 Bane 87
# 12 5 Ken 87
Like proc_transpose()
, the function has a default usage,
which is to sort by all variables:
# Default sort
res <- proc_sort(dat)
# View results
res
# ID Name Score
# 9 1 David 45
# 1 1 David 74
# 10 2 Ram 54
# 2 2 Sam 45
# 3 3 Bane 87
# 11 3 Bane 87
# 4 3 Mary 92
# 5 4 Dane 23
# 6 5 Jenny 87
# 12 5 Ken 87
# 7 6 Simran 63
# 8 8 Priya 72
To sort by a specific variable, you can use the by
parameter:
# Sort By
res <- proc_sort(dat, by = Score)
# View results
res
# ID Name Score
# 5 4 Dane 23
# 2 2 Sam 45
# 9 1 David 45
# 10 2 Ram 54
# 7 6 Simran 63
# 8 8 Priya 72
# 1 1 David 74
# 3 3 Bane 87
# 6 5 Jenny 87
# 11 3 Bane 87
# 12 5 Ken 87
# 4 3 Mary 92
You can also sort by two variables:
# Sort By
res <- proc_sort(dat, by = v(Score, Name))
# View results
res
# ID Name Score
# 5 4 Dane 23
# 9 1 David 45
# 2 2 Sam 45
# 10 2 Ram 54
# 7 6 Simran 63
# 8 8 Priya 72
# 1 1 David 74
# 3 3 Bane 87
# 11 3 Bane 87
# 6 5 Jenny 87
# 12 5 Ken 87
# 4 3 Mary 92
Notice that when the scores are tied, the data is now sorted alphabetically by name.
To put the highest scores on top, you can sort descending using the
order
parameter.
# Sort By
res <- proc_sort(dat, by = Score, order = descending)
# View results
res
# ID Name Score
# 4 3 Mary 92
# 3 3 Bane 87
# 6 5 Jenny 87
# 11 3 Bane 87
# 12 5 Ken 87
# 1 1 David 74
# 8 8 Priya 72
# 7 6 Simran 63
# 10 2 Ram 54
# 2 2 Sam 45
# 9 1 David 45
# 5 4 Dane 23
The order
instructions can be abbreviated “a” for
ascending and “d” for descending. These abbreviations are convenient
when there is more than one by
variable.
The keep
parameter allows you to select only some of the
variables for the output dataset:
Another convenient feature of proc_sort()
is the
nodupkey
option. This option will eliminate duplicates
based on the unique combination of values of the by
variables. For instance, to get a unique list of students, you could use
keep
and options = nodupkey
:
# Keep and Nodupkey
res <- proc_sort(dat, by = Name, keep = Name, options = nodupkey)
# View results
res
# Name
# 3 Bane
# 5 Dane
# 1 David
# 6 Jenny
# 12 Ken
# 4 Mary
# 8 Priya
# 10 Ram
# 2 Sam
# 7 Simran
There is also a dupkey
option to retain only records
with duplicate key values. This feature would allow you to easily find
students who took the exam twice, and show the scores for each
attempt.
Most of the time, you will sort alphabetically or numerically. But sometimes you may have character data that you want to sort in a specific order that is not alphabetic. Let’s return to the class data we used above:
# Create input data
dat <- read.table(header = TRUE, text = '
Name Subject Semester1 Semester2
Samma Maths 96 94
Sandy English 76 51
Devesh German 76 95
Rakesh Maths 50 63
Priya English 62 80
Kranti Maths 92 92
William German 87 75')
# View data
dat
# Name Subject Semester1 Semester2
# 1 Samma Maths 96 94
# 2 Sandy English 76 51
# 3 Devesh German 76 95
# 4 Rakesh Maths 50 63
# 5 Priya English 62 80
# 6 Kranti Maths 92 92
# 7 William German 87 75
What if we wanted to sort “Maths” to the top, followed by “English” and “German”? How could that be accomplished?
To perform a sort in a specific, non-alphabetic order, first create a factor on your desired sort column and order the levels by the intended sort. For instance:
# Create factor with ordered levels
dat$Subject <- factor(dat$Subject, levels = c("Maths", "English", "German"))
# View attributes
attributes(dat$Subject)
# $levels
# [1] "Maths" "English" "German"
#
# $class
# [1] "factor"
Then use proc_sort()
to sort by the factor variable:
# Sort by factor variable
dat2 <- proc_sort(dat, by = Subject)
# View result
dat2
# Name Subject Semester1 Semester2
# 1 Samma Maths 96 94
# 4 Rakesh Maths 50 63
# 6 Kranti Maths 92 92
# 2 Sandy English 76 51
# 5 Priya English 62 80
# 3 Devesh German 76 95
# 7 William German 87 75
Note that after the sort operation, the “Subject” variable is still a factor:
If desired, we can turn the variable back into a character using the “as.character” parameter. Following the sort, this parameter will automatically cast any factors on the by parameter to character variables:
# Sort by factor variable
dat2 <- proc_sort(dat, by = Subject, as.character = TRUE)
# Same results
dat2
# Name Subject Semester1 Semester2
# 1 Samma Maths 96 94
# 4 Rakesh Maths 50 63
# 6 Kranti Maths 92 92
# 2 Sandy English 76 51
# 5 Priya English 62 80
# 3 Devesh German 76 95
# 7 William German 87 75
# But now Subject is a character
class(dat2$Subject)
# [1] "character"