Using dataframes

Mariana Montes

2024-05-19

The glossr package encourages you to keep your examples in one dataframe that you can extract glosses from. You can filter it based on the label names or any other variables and print a series of glosses next to each other with one call.

If you like this feature and you have, for example, a dataframe called glosses, you might find yourself calling variations of gloss_df(filter(glosses, "my-label")) multiple times in a text. This vignette will show you how to work with gloss_factory() so that you only need to type my_gloss("my-label") instead. In addition, this function performs some validation on your dataframe to avoid undesired output.

library(glossr)
#> Setting up the leipzig engine.
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> 
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> 
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(stringr)

Create a gloss factory

The first thing you need to do is assign the return value of gloss_factory() to a short variable that works for you. I recommend trying this out in the console, and then calling it in a setup R chunk that doesn’t print messages or warnings.

by_label <- gloss_factory(glosses)
#> ℹ The following columns will be used for the gloss texts, in the following order:
#> ✔ `source` (not aligned!)
#> ✔ `original`, `parsed`, and `language` (aligned columns)
#> ✔ `translation` (not aligned!)
#> ✔ The `label` column will be used for labels.

By default (unless verbose = FALSE), gloss_factory() prints a few messages after checking the dataframe that was provided: it checks whether there are source, translation and label columns (“not aligned”, because they are printed as running text) and which would be the remaining columns with content for the text lines (“aligned”, because they are aligned to each other word by word). Notice how here it includes the language column in the group of aligned lines, which we don’t want, so we would prefer to remove it.

If any of the expected columns (source, translation or label) are not present, it will print a warning. These are just warnings: maybe it’s exactly what you are expecting, and that’s ok.

by_label <- glosses |> 
  select(-language, -translation) |> 
  gloss_factory()
#> ℹ The following columns will be used for the gloss texts, in the following order:
#> ✔ `source` (not aligned!)
#> ✔ `original` and `parsed` (aligned columns)
#> ✖ `translation` (not aligned!)
#> ✔ The `label` column will be used for labels.

If there are too many text columns, it will also warn you:

by_label <- glosses|> 
  rename(trans = translation)|> 
  gloss_factory()
#> ! There are 4 columns that can be printed as text: `original`, `parsed`,
#> `trans`, and `language`. Only the first three will be used.
#> ℹ The following columns will be used for the gloss texts, in the following order:
#> ✔ `source` (not aligned!)
#> ✔ `original`, `parsed`, and `trans` (aligned columns)
#> ✖ `translation` (not aligned!)
#> ✔ The `label` column will be used for labels.

We can either remove the extra column from the dataframe before giving it to gloss_factory() or add its name to the ignore_columns argument. This allows us to use the column for filtering without gloss_df() finding out of its existence. Other kinds of modifications, however, would have to be performed beforehand.

modified_glosses <- glosses |>
  mutate(source = paste0("(", source, ")"))
by_label <- modified_glosses |>
  gloss_factory(ignore_columns = "language")
#> ℹ The following columns will be used for the gloss texts, in the following order:
#> ✔ `source` (not aligned!)
#> ✔ `original` and `parsed` (aligned columns)
#> ✔ `translation` (not aligned!)
#> ✔ The `label` column will be used for labels.

gloss_factory() is a function factory: its output is a function.

class(by_label)
#> [1] "function"

This means that you call gloss_factory() once at the beginning, and then your created function as many times as you need. Here the function is called by_label(), but you can choose the name that suits you best. As you can see below, by_label("heartwarming-jp") is equivalent to gloss_df(filter(modified_glosses, label == "heartwarming-jp")).

by_label("heartwarming-jp")
  1. (Shindo 2015:660)

    Kotae-nagara otousan to okaasan wa honobonoto atatakai2 mono ni tsutsum-areru kimochi ga shi-ta.

    reply-while father and mother TOP heartwarming warm thing with surround-PASS feeling NOM do-PST

    “While replying (to your question), Father and Mother felt like they were surrounded by something heart warming.”

Filter by label or id

By default, the function created by gloss_factory() will take a label or set of labels and use it for filtering. In principle, the call below is equivalent to gloss_df(filter(modified_glosses, label %in% c("heartwarming-jp", "languid-jp", "feel-dutch"))). However, unlike filter(), it keeps the requested order of your items!

by_label("heartwarming-jp", "languid-jp", "feel-dutch")
  1. (Shindo 2015:660)

    Kotae-nagara otousan to okaasan wa honobonoto atatakai2 mono ni tsutsum-areru kimochi ga shi-ta.

    reply-while father and mother TOP heartwarming warm thing with surround-PASS feeling NOM do-PST

    “While replying (to your question), Father and Mother felt like they were surrounded by something heart warming.”

  2. (Shindo 2015:660)

    Ainiku sonna shumi wa nai. Tsumetai-none. Kedaru-souna koe da-tta.

    unfortunately such interest TOP not.exist cold-EMPH languid-seem voice COP-PST

    “Unfortunately I never have such an interest. You are so cold. (Her) voice sounded languid.”

  3. (Ross 1996:204)

    Ik heb het koud

    1SG have 3SG COLD.A

    “I am cold; literally: I have it cold.”

You could also set a different column for your ids with the id_column argument. gloss_factory() will warn you if the values are not unique (in case you were expecting them to).

by_language <- modified_glosses |> 
  gloss_factory(id_column = "language", ignore_columns = "language")
#> ℹ The following columns will be used for the gloss texts, in the following order:
#> ✔ `source` (not aligned!)
#> ✔ `original` and `parsed` (aligned columns)
#> ✔ `translation` (not aligned!)
#> ✔ The `label` column will be used for labels.
#> ! The values in `language` are not unique. Only the first match of repeated ids will be returned.
by_language("Icelandic")
  1. (Einarsson 1945:170)

    Mér er heitt/kalt

    1SG.DAT COP.1SG.PRS hot/cold.A

    “I am hot/cold.”

You will also get a warning if one of your requested ids is not in your dataset.

by_language("Japanese", "Mandarin")
#> ! The following ids are not present in the dataset:
#> • Mandarin
  1. (Shindo 2015:660)

    Kotae-nagara otousan to okaasan wa honobonoto atatakai2 mono ni tsutsum-areru kimochi ga shi-ta.

    reply-while father and mother TOP heartwarming warm thing with surround-PASS feeling NOM do-PST

    “While replying (to your question), Father and Mother felt like they were surrounded by something heart warming.”

Filter with other conditional statements

While filtering by label name might be a common circumstance, you might want a bit more freedom. It is possible to create a different function with the use_conditionals argument. In that case, the new function will take whatever conditionals you want to ask and send them to dplyr::filter().

by_cond <- modified_glosses |>
  gloss_factory(use_conditionals = TRUE, ignore_columns = "language")
#> ℹ The following columns will be used for the gloss texts, in the following order:
#> ✔ `source` (not aligned!)
#> ✔ `original` and `parsed` (aligned columns)
#> ✔ `translation` (not aligned!)
#> ✔ The `label` column will be used for labels.
by_cond(str_ends(label, "jp"))
  1. (Shindo 2015:660)

    Kotae-nagara otousan to okaasan wa honobonoto atatakai2 mono ni tsutsum-areru kimochi ga shi-ta.

    reply-while father and mother TOP heartwarming warm thing with surround-PASS feeling NOM do-PST

    “While replying (to your question), Father and Mother felt like they were surrounded by something heart warming.”

  2. (Shindo 2015:660)

    Ainiku sonna shumi wa nai. Tsumetai-none. Kedaru-souna koe da-tta.

    unfortunately such interest TOP not.exist cold-EMPH languid-seem voice COP-PST

    “Unfortunately I never have such an interest. You are so cold. (Her) voice sounded languid.”

Many factories?

One of the advantages of a function factory is that you can create a function tailored to the dataset you’re working with here. You don’t need to call your dataset constantly and you save in typing.

In addition, you could have multiple factories in one project. Within a file, you may create a by_label() and a by_cond() functions to work with label and conditional filtering, whatever suits you best at any time. Or you could also have a dutch_gloss() and chinese_gloss(), for example, each using a different dataset!