This tutorial aims to provide a simple overview of what is included
within the finnsurveytext
package and teach you how to use
the main functions included in the package.
The below table shows you all the functions that are included in the package. The functions which are bolded are the main functions which are outlined in the sections below.
Section | Usage | Functions |
---|---|---|
1. Data Preparation | use the udpipe R package to clean and
annotate the raw data into a standardised format (CoNLL-U) suitable for
analysis. |
fst_format() fst_find_stopwords() fst_rm_stop_punct() fst_prepare() fst_prepare_svydesign() |
2. Data Exploration | create wordclouds, n-gram tables and summary tables for initial insights into trends across responses. | fst_summarise_short() fst_summarise() fst_pos() fst_length_summary() fst_use_svydesign() fst_freq_table fst_ngrams_table() fst_ngrams_table2() fst_freq_plot() fst_ngrams_plot() fst_freq() fst_ngrams() fst_wordcloud() |
3. Concept Network | creation of a concept network using the textrank
R package with node size indicating word importance
(PageRank) and edge weight showing co-occurrence of words. |
fst_cn_search() fst_cn_edges() fst_cn_nodes() fst_cn_plot() fst_concept_network() |
4. Comparison Functions | corresponding Data Exploration and Concept Network functions allowing for comparison between groups of survey respondents. | fst_pos_compare() fst_summarise_compare() fst_length_compare() fst_get_unique_ngrams_separate() fst_get_unique_ngrams() fst_join_unique() fst_ngrams_compare_plot() fst_freq_compare() fst_ngrams_compare() fst_comparison_cloud() fst_cn_get_unique_separate() fst_cn_get_unique() fst_cn_compare_plot() fst_concept_network_compare() |
First, the finnsurveytext
package needs to be installed
into your R environment and loaded into the environment. You may also
want to load in the survey
package if you want to use a
svydesign
object for the data and/or weights.
The data preparation functions are used to take your raw survey data (in a dataframe or svydesign object within your R environment) and convert it into a standardised format ready for analysis.
The functions in the remaining sections require your data to be pre-formatted into this format.
(To learn move about the format we use, see the Universal Dependencies Project.)
The package comes with sample data. For this demonstration, we use
dev_coop
. The raw data looks like this:
fsd_id | q11_1 | q11_2 | q11_3 | paino | gender | region | year_of_birth | education_level |
---|---|---|---|---|---|---|---|---|
1 | elämiseen tarvittavat perusasiat ovat kehittymättöneet (esim. vesi, talo, ruoka). Ja niitä ei ole riittävästi | varmaan koitetaan kehittää em | saastuminen ja luonnonvarojen liikakäyttö, nälänhätä ja ylikansoittuminen | 0.5440 | Female | Etelä-Suomi | 1992 | NA |
2 | on kurjuutta ja nälänhätää, asiat eivät ole vielä kehittyneet, lapset eivät pääse kouluun, tytöillä on huonompi asema kuin pojilla. | pyritään auttamaan? | ihmiskauppa, nälänhätä ja sodat/turvattomuus | 0.7171 | Female | Pohjois- ja Itä-Suomi | 1994 | Matriculation examination |
3 | jokaisella ei ole turvattua toimeentuloa ja jossa todella huomaa koulutuksen arvon. | autetaan ja näytetään ihmisille tie parempaan tulevaisuuteen heidän oman työnsä tuloksena. | kouluttamattomuus, nälkä ja puhtaan veden puute. | 0.6240 | Female | Helsinki-Uusimaa | 1994 | Matriculation examination |
4 | kehityksen taso ei ole yhtä korkea kuin kehittyneissä maissa | yleensä haitataan kehitysmaan kehittymistä | öljy, raha, se fakta että ei olla vielä päästy asumaan muualla kuin tällä yhdellä planeetalla | 0.3401 | NA | Länsi-Suomi | NA | NA |
5 | asiat ovat vielä huonossa jamassa ja apua tarvitaan | eriarvoisuus, sodat, nälänhätä tietyissä maissa | 0.6240 | Female | Helsinki-Uusimaa | 1993 | Upper secondary vocational qualification |
We will look at question 11_3 (responses to ‘’Jatka lausetta: Maailman kolme suurinta ongelmaa ovat… (Avokysymys)’) as our open-ended survey question. We also want to include our survey weights (in ‘paino’ column) and bring in the gender and region columns so we can use these values to compare groups.
The main function here is fst_prepare()
# FUNCTION DEFINITION
fst_prepare <- function(data,
question,
id,
model = "ftb",
stopword_list = "nltk",
weights = NULL,
add_cols = NULL,
manual = FALSE,
manual_list = "")
We can run the function as follows:
df <- fst_prepare(data = dev_coop,
question = 'q11_3',
id = 'fsd_id',
weights = 'paino',
add_cols = c('gender', 'region')
)
Summary of components
data
is the dataframe of interest. In this case, we are
using data that comes with the package called “dev_coop”.question
is the name of the column in your data
which contains the open-ended survey question. In this example, we’re
considering “q11_3”id
is the id column in our data, which is “fsd_id”udpipe
, in this case we are using the default Finnish
Treebank, model = "ftb"
. (There are two options for Finnish
language model; the other option is the Turku Dependency Treebank
“tdt”.)stopword_list
in this example. (To find the relevant lists
of Finnish stopwords, you can run the fst_find_stopwords()
function.) Punctuation is also removed from the data whenever stopwords
are removed.weights
column in your
formatted data. Our weights are stored in the raw data as “paino”.df
.manual
and manual_list
are used if you
want to provide your own list of stopwords to remove from the
data.)The formatted data looks like this:
doc_id | paragraph_id | sentence_id | sentence | token_id | token | lemma | upos | xpos | feats | head_token_id | dep_rel | deps | misc | weight | gender | region |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | saastuminen ja luonnonvarojen liikakäyttö, nälänhätä ja ylikansoittuminen | 1 | saastuminen | saastuminen | NOUN | N,Sg,Nom | Case=Nom|Number=Sing | 0 | root | NA | NA | 0.544 | Female | Etelä-Suomi |
1 | 1 | 1 | saastuminen ja luonnonvarojen liikakäyttö, nälänhätä ja ylikansoittuminen | 3 | luonnonvarojen | luonnonvaro | NOUN | N,Pl,Gen | Case=Gen|Number=Plur | 4 | nmod | NA | NA | 0.544 | Female | Etelä-Suomi |
The other option is to get your data from a svydesign
object from the survey
package. The survey
package is a popular package used for analysing surveys.
The main function here is fst_prepare_svydesign()
# FUNCTION DEFINITION
fst_prepare_svydesign <- function(svydesign,
question,
id,
model = "ftb",
stopword_list = "nltk",
use_weights = TRUE,
add_cols = NULL,
manual = FALSE,
manual_list = "")
We can run the function as follows:
df2 <- fst_prepare_svydesign(svydesign = svy_dev,
question = 'q11_3',
id = 'fsd_id',
use_weights = TRUE,
add_cols = c('gender', 'region')
)
The only differences between the previous function and this one are:
svydesign
is your svydesign object. In this
case, we have one called “svy_dev”use_weights = TRUE
The formatted data looks like this (should look very similar to the above formatted data!):
doc_id | paragraph_id | sentence_id | sentence | token_id | token | lemma | upos | xpos | feats | head_token_id | dep_rel | deps | misc | weight | gender | region |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | saastuminen ja luonnonvarojen liikakäyttö, nälänhätä ja ylikansoittuminen | 1 | saastuminen | saastuminen | NOUN | N,Sg,Nom | Case=Nom|Number=Sing | 0 | root | NA | NA | 0.544 | Female | Etelä-Suomi |
1 | 1 | 1 | saastuminen ja luonnonvarojen liikakäyttö, nälänhätä ja ylikansoittuminen | 3 | luonnonvarojen | luonnonvaro | NOUN | N,Pl,Gen | Case=Gen|Number=Plur | 4 | nmod | NA | NA | 0.544 | Female | Etelä-Suomi |
Now that we have formatted data, we can begin data exploration. These functions are used to create summary tables and to find the most common themes in your survey responses.
First, let’s create some summaries using fst_summarise
,
fst_pos
and fst_length_summary
These functions are defined as follows:
# FUNCTION DEFINITIONS
fst_summarise <- function(data,
desc = "All respondents")
fst_pos <- function(data)
fst_length_summary <- function(data,
desc = "All respondents",
incl_sentences = TRUE)
Summary of components
data
is the formatted data.desc
is an optional name for the responses summarised,
if not provided it will default to “All respondents”.incl_sentences
is an optional boolean for whether to
also summarise sentence length (in addition to word length), if not
provided it will default to TRUE.Hence, these functions are run for our sample data as follows:
## Description Respondents No Response Proportion Total Words Unique Words
## 1 All responses 945 25 0.97 4192 1132
## Unique Lemmas
## 1 994
## UPOS UPOS_Name Count Proportion
## 1 ADJ adjective 389 0.093
## 2 ADP adposition 24 0.006
## 3 ADV adverb 64 0.015
## 4 AUX auxiliary 3 0.001
## 5 CCONJ coordinating conjunction 3 0.001
## 6 DET determiner 28 0.007
## 7 INTJ interjection 2 0.000
## 8 NOUN noun 3311 0.790
## 9 NUM numeral 5 0.001
## 10 PART particle 29 0.007
## 11 PRON pronoun 12 0.003
## 12 PROPN proper noun 31 0.007
## 13 PUNCT punctuation NA NA
## 14 SCONJ subordinating conjunction NA NA
## 15 SYM symbol 1 0.000
## 16 VERB verb 278 0.066
## 17 X other 12 0.003
## # A tibble: 2 × 8
## Description Respondents Mean Minimum Q1 Median Q3 Maximum
## <chr> <int> <dbl> <int> <dbl> <dbl> <dbl> <int>
## 1 All responses- Words 920 5.52 1 4 5 6 32
## 2 All responses- Sentences 920 1.01 1 1 1 1 3
The first of our frequent words visualisations in the wordcloud which
comes from the wordcloud
package.
It is defined as follows:
# FUNCTION DEFINITION
fst_wordcloud <- function(data,
pos_filter = NULL,
max = 100,
use_svydesign_weights = FALSE,
id = "",
svydesign = NULL,
use_column_weights = FALSE)
Summary of components
data
is the formatted data.pos_filter
is an optional list of POS tags for
inclusion in the wordcloud. The default is NULL.max
is the maximum number of words to display, the
default is 100.Then, we have options for weighting the words in the cloud. These will all default to not include weights.
use_svydesign_weights
should be set as TRUE if we want
to use weights from within a svydesign object.id
is only required if weights are coming from a
svydesign objectsvydesign
objectHere are some examples of creating wordclouds:
Then, we have functions to identify and plot the most frequent words or n-grams (sets of n words in order).
# FUNCTION DEFINITIONS
fst_freq <- function(data,
number = 10,
norm = NULL,
pos_filter = NULL,
strict = TRUE,
name = NULL,
use_svydesign_weights = FALSE,
id = "",
svydesign = NULL,
use_column_weights = FALSE)
fst_ngrams <- function(data,
number = 10,
ngrams = 1,
norm = NULL,
pos_filter = NULL,
strict = TRUE,
name = NULL,
use_svydesign_weights = FALSE,
id = "",
svydesign = NULL,
use_column_weights = FALSE)
Summary of components
data
is the formatted data.number
is the number of top words/ngrams to
displayngrams
is the type of n-grams, default is
1
.norm
is an optional method for normalising the data.
Valid settings are “number_words” (the number of words in the
responses), “number_resp” (the number of responses), or NULL (raw count
returned, default, also used when weights are
applied).pos_filter
is an optional list of POS tags for
inclusion. The default is NULL.strict
is whether to strictly cut-off at
number
(ties are alphabetically ordered). The default value
is TRUE.name
is an optional “name” for the plot to add to
title, default is NULL.Then, we again have options for weighting the words in the plot. Again, these all default to not include weights.
use_svydesign_weights
should be set as TRUE if we want
to use weights from within a svydesign object.id
is only required if weights are coming from a
svydesign objectuse_svydesign_weights
should be set as TRUE if we want
to use weights from the weight column as set-up during the data
formatting.(fst_freq_table()
and fst_ngrams_table()
can be used to instead create tables of the top words.)
## words occurrence
## 1 köyhyys 258
## 2 nälänhätä 239
## 3 sota 231
## 4 ilmastonmuutos 141
## 5 puute 117
## 6 ihminen 105
## 7 vesi 98
## 8 epätasa-arvo 87
## 9 ahneus 84
## 10 nälkä 81
## 11 puhdas 75
## 12 sairaus 59
## 13 itsekkyys 58
## 14 väestönkasvu 48
## 15 välinpitämättömyys 47
Our concept network function uses the TextRank algorithm which is a graph-based ranking model for text processing. Vertices represent words and co-occurrence between words is shown through edges. Word importance is determined recursively (through the unsupervised learning TextRank algorithm) where words get more weight based on how many words co-occur and the weight of these co-occurring words.
To utilise the TextRank algorithm in finnsurveytext, we use the textrank package. For further information on the package, please see this documentation. This package implements the TextRank and PageRank algorithms. (PageRank is the algorithm that Google uses to rank webpages.) You can read about the underlying TextRank algorithm here and about the PageRank algorithm here.
The main concept network function is
fst_concept_network()
. It is defined as follows:
# FUNCTION DEFINITIONS
fst_concept_network <- function(data,
concepts,
threshold = NULL,
norm = NULL,
pos_filter = NULL,
title = NULL)
Summary of components
data
is the formatted data.concepts
are the concept words around which the network
is created.threshold
is a minimum number of occurrences threshold
for ‘edge’ between searched term and other word, default is
NULL
. Note, the threshold is applied before
normalisation.norm
is an optional method for normalising the data.
Valid settings are “number_words” (the number of words in the
responses), “number_resp” (the number of responses), or NULL (raw count
returned, default, also used when weights are
applied).pos_filter
is an optional list of POS tags for
inclusion. The default is NULL.title
is an optional title for plot, default is
NULL
and a generic title (“TextRank extracted keyword
occurrences”) will be used.For example, we can create the following concept network plots:
Recall that when we preprocessed the data, we included two additional columns, gender and region, to allow for comparison between respondents based on these values.
There are counterpart comparison functions for each of the functions we have shown above.
The comparison summary tables are defined as follows:
fst_pos_compare <- function(data,
field,
exclude_nulls = FALSE,
rename_nulls = 'null_data')
fst_summarise_compare <- function(data,
field,
exclude_nulls = FALSE,
rename_nulls = 'null_data')
fst_length_compare <- function(data,
field,
incl_sentences = TRUE,
exclude_nulls = FALSE,
rename_nulls = 'null_data')
Summary of Components
data
is the formatted data.field
is the column in data
used for
splitting groupsexclude_nulls
. The
default value is FALSE.rename_nulls
is what to fill empty values with if
exclude_nulls = FALSE
.Let’s compare our responses based on the region of the respondent:
UPOS | Part_of_Speech_Name | Etelä-Suomi-Count | Etelä-Suomi-Prop | Helsinki-Uusimaa-Count | Helsinki-Uusimaa-Prop | Länsi-Suomi-Count | Länsi-Suomi-Prop | Pohjois- ja Itä-Suomi-Count | Pohjois- ja Itä-Suomi-Prop | null_data-Count | null_data-Prop |
---|---|---|---|---|---|---|---|---|---|---|---|
ADJ | adjective | 79 | 0.101 | 118 | 0.098 | 105 | 0.082 | 84 | 0.093 | 3 | 0.188 |
ADP | adposition | 4 | 0.005 | 11 | 0.009 | 6 | 0.005 | 3 | 0.003 | NA | NA |
ADV | adverb | 8 | 0.010 | 20 | 0.017 | 22 | 0.017 | 13 | 0.014 | 1 | 0.062 |
AUX | auxiliary | 1 | 0.001 | 1 | 0.001 | 1 | 0.001 | NA | NA | NA | NA |
CCONJ | coordinating conjunction | 1 | 0.001 | NA | NA | 1 | 0.001 | 1 | 0.001 | NA | NA |
DET | determiner | 6 | 0.008 | 10 | 0.008 | 7 | 0.005 | 5 | 0.006 | NA | NA |
INTJ | interjection | NA | NA | 2 | 0.002 | NA | NA | NA | NA | NA | NA |
NOUN | noun | 610 | 0.776 | 936 | 0.774 | 1028 | 0.805 | 725 | 0.802 | 12 | 0.750 |
NUM | numeral | 1 | 0.001 | 1 | 0.001 | 3 | 0.002 | NA | NA | NA | NA |
PART | particle | 2 | 0.003 | 15 | 0.012 | 9 | 0.007 | 3 | 0.003 | NA | NA |
PRON | pronoun | 1 | 0.001 | 4 | 0.003 | 3 | 0.002 | 4 | 0.004 | NA | NA |
PROPN | proper noun | 3 | 0.004 | 12 | 0.010 | 14 | 0.011 | 2 | 0.002 | NA | NA |
PUNCT | punctuation | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
SCONJ | subordinating conjunction | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
SYM | symbol | NA | NA | 1 | 0.001 | NA | NA | NA | NA | NA | NA |
VERB | verb | 69 | 0.088 | 72 | 0.060 | 75 | 0.059 | 62 | 0.069 | NA | NA |
X | other | 1 | 0.001 | 6 | 0.005 | 3 | 0.002 | 2 | 0.002 | NA | NA |
Description | Respondents | No Response | Proportion | Total Words | Unique Words | Unique Lemmas |
---|---|---|---|---|---|---|
Etelä-Suomi | 176 | 0 | 1.00 | 786 | 360 | 336 |
Helsinki-Uusimaa | 269 | 10 | 0.96 | 1209 | 493 | 443 |
Länsi-Suomi | 284 | 8 | 0.97 | 1277 | 478 | 425 |
Pohjois- ja Itä-Suomi | 213 | 7 | 0.97 | 904 | 338 | 309 |
null_data | 3 | 0 | 1.00 | 16 | 16 | 16 |
Description | Respondents | Mean | Minimum | Q1 | Median | Q3 | Maximum |
---|---|---|---|---|---|---|---|
Etelä-Suomi- Words | 176 | 5.483 | 2 | 4.0 | 5 | 6 | 23 |
Etelä-Suomi- Sentences | 176 | 1.017 | 1 | 1.0 | 1 | 1 | 3 |
Helsinki-Uusimaa- Words | 259 | 5.579 | 1 | 4.0 | 5 | 6 | 25 |
Helsinki-Uusimaa- Sentences | 259 | 1.023 | 1 | 1.0 | 1 | 1 | 3 |
Länsi-Suomi- Words | 276 | 5.620 | 1 | 4.0 | 5 | 6 | 32 |
Länsi-Suomi- Sentences | 276 | 1.007 | 1 | 1.0 | 1 | 1 | 2 |
Pohjois- ja Itä-Suomi- Words | 206 | 5.311 | 1 | 4.0 | 5 | 6 | 20 |
Pohjois- ja Itä-Suomi- Sentences | 206 | 1.015 | 1 | 1.0 | 1 | 1 | 2 |
null_data- Words | 3 | 6.333 | 5 | 5.5 | 6 | 7 | 8 |
null_data- Sentences | 3 | 1.000 | 1 | 1.0 | 1 | 1 | 1 |
The ngrams comparison functions are defined similarly (with some additional new values):
# FUNCTION DEFINITIONS
fst_freq_compare <- function(data,
field,
number = 10,
norm = NULL,
pos_filter = NULL,
strict = TRUE,
use_svydesign_weights = FALSE,
id = "",
svydesign = NULL,
use_column_weights = FALSE,
exclude_nulls = FALSE,
rename_nulls = 'null_data',
unique_colour = "indianred",
title_size = 20,
subtitle_size = 15)
fst_ngrams_compare <- function(data,
field,
number = 10,
ngrams = 1,
norm = NULL,
pos_filter = NULL,
strict = TRUE,
use_svydesign_weights = FALSE,
id = "",
svydesign = NULL,
use_column_weights = FALSE,
exclude_nulls = FALSE,
rename_nulls = 'null_data',
unique_colour = "indianred",
title_size = 20,
subtitle_size = 15)
The new components are:
unique_colour
is chosen to differentiate values which
are unique to one group of respondents, the default is “indianred”title_size
and subtitle_size
set these,
you may need to change them from the default values if any of your group
names are long or if there are many groups.For the ngrams, let’s compare respondents by gender.
The comparison cloud extends the wordcloud concept.
A comparison cloud compares the relative frequency with which a term is used in two or more documents. This cloud shows words that occur more regularly in responses from a specific type of respondent. For more information about comparison clouds, you can read this documentation.
The comparison cloud is defined as follows, with settings as defined for the previous functions:
# FUNCTION DEFINITION
fst_comparison_cloud <- function(data,
field,
pos_filter = NULL,
norm = NULL,
max = 100,
use_svydesign_weights = FALSE,
id = "",
svydesign = NULL,
use_column_weights = FALSE,
exclude_nulls = FALSE,
rename_nulls = "null_data")
Thus, we can create comparison clouds:
Finally we have the comparison concept network which has the following components which should be familiar from previous functions:
# FUNCTION DEFINITION
fst_concept_network_compare <- function(data,
concepts,
field,
norm = NULL,
threshold = NULL,
pos_filter = NULL,
exclude_nulls = FALSE,
rename_nulls = 'null_data',
title_size = 20,
subtitle_size = 15)
We run the comparison concept network as follows:
fst_concept_network_compare(df,
concepts = "köyhyys, nälänhätä, sota, ilmastonmuutos, puute",
'gender',
exclude_nulls = TRUE
)
For more information on the finnsurveytext
functions,
see the package
website and documentation available from the CRAN.
The package comes with sample data from two surveys obtained from the Finnish Social Science Data Archive: