Abstract
The R packagetm.plugin.koRpus
is an extension to the koRpus
package, enhancing its usability for actual corpus analysis. It adds object classes and methods, inheriting from those provided by koRpus
, which are designed to work with complete text corpora in both koRpus
and tm
formats. This vignette gives you a quick overview.
While the koRpus
package focusses mostly on analysis steps of individual texts, tm.plugin.koRpus
adds a new object class and respective methods, which can be used to analyse complete text corpora in a single step. The object class can also be a first step to building a bridge between the koRpus
and tm
packages.
At the core of this package there is one particular object class – kRp.corpus
– which can be used to construct simple corpus objects or even hierarchically nested corpora. That is, you are able to categorize corpora on as many levels as you need. The examples in this vignette use two levels, one being different topics the texts in the sample corpus deal with, and the other different sources the texts come from.
If you don’t need these hierarchical levels, you can just use the method readCorpus()
to create a corpus object, i.e., a simple collection of texts. To distinguish texts which came from different sources or deal with different topics, use the hierarchy
argument, which will add categorial columns to the tagged text objects. These objects will only be valid if there are texts of each topic from each source.
Now, if this still confuses you, let’s look at a small example.
As with koRpus
, the first step for text analysis is tokenizing and possibly POS tagging. This step is performed by the readCorpus()
method mentioned above. The package includes four sample texts taken from Wikipedia1 in its tests
directory which we can use for a demonstration:
library(tm.plugin.koRpus)
library(koRpus.lang.de)
# set the root path to the sample files
sampleRoot <- file.path(path.package("tm.plugin.koRpus"), "tests", "testthat", "samples")
# the next call uses "hierarchy" to describe the directory structure
# and its meaning; see below
sampleTexts <- readCorpus(
dir=sampleRoot,
hierarchy=list(
Topic=c(
C3S="C3S SCE",
GEMA="GEMA e.V."
),
Source=c(
Wikipedia_alt="Wikipedia (alt)",
Wikipedia_neu="Wikipedia (neu)"
)
),
tagger="tokenize",
lang="de"
)
Processing corpus...
Topic "C3S SCE", 2 texts...
Source "Wikipedia (alt)", 1 text...
Source "Wikipedia (neu)", 1 text...
Topic "GEMA e.V.", 2 texts...
Source "Wikipedia (alt)", 1 text...
Source "Wikipedia (neu)", 1 text...
hierarchy
argumentThe hierarchy
argument describes our corpus in a very condensed format. It is a named list of named character vectors, where each list entry represents a hierarchical level. In this case, the top level is called “Topics”, below that is the level “Source”. These hierachical levels must also be represented by the directory structure of the texts to parse, and the names of the character vectors must be identical to the directory names below the root directory specified by dir
.2
So on your file system, what the hierarchy
argument above describes is the following layout:
.../samples/
C3S/
Wikipedia_alt/
Text01.txt
Text02.txt
...
Wikipedia_neu/
Text03.txt
Text04.txt
...
GEMA/
Wikipedia_alt/
Text05.txt
Text06.txt
...
Wikipedia_neu/
Text07.txt
Text08.txt
...
Since we’re using the koRpus
package for all actual analysis, you can also setup your environment with set.kRp.env()
and POS-tag all texts with TreeTagger
3.
After the initial tokenizing, we can analyse the corpus by calling the provided methods, for instance lexical diversity:
doc_id Topic
C3S-Wikipedia_alt-C3S_2013-09-24.txt C3S-Wikipedia_alt-C3S_2013-09-24.txt C3S SCE
C3S-Wikipedia_neu-C3S_2015-07-05.txt C3S-Wikipedia_neu-C3S_2015-07-05.txt C3S SCE
GEMA-Wikipedia_alt-GEMA_2013-09-26.txt GEMA-Wikipedia_alt-GEMA_2013-09-26.txt GEMA e.V.
GEMA-Wikipedia_neu-GEMA_2015-07-05.txt GEMA-Wikipedia_neu-GEMA_2015-07-05.txt GEMA e.V.
Source a C CTTR HDD K lgV0 MATTR MSTTR
C3S-Wikipedia_alt-C3S_2013-09-24.txt Wikipedia (alt) 0.16 0.95 6.13 38.14 49.92 6.21 0.81 0.79
C3S-Wikipedia_neu-C3S_2015-07-05.txt Wikipedia (neu) 0.17 0.94 6.82 38.05 54.88 6.10 0.82 0.76
GEMA-Wikipedia_alt-GEMA_2013-09-26.txt Wikipedia (alt) 0.17 0.94 7.07 37.61 65.08 6.11 0.80 0.78
GEMA-Wikipedia_neu-GEMA_2015-07-05.txt Wikipedia (neu) 0.16 0.94 7.13 37.87 60.14 6.24 0.81 0.79
MTLD MTLDMA R S TTR U
C3S-Wikipedia_alt-C3S_2013-09-24.txt 100.16 NA 8.68 0.93 0.78 39.92
C3S-Wikipedia_neu-C3S_2015-07-05.txt 123.01 NA 9.65 0.92 0.73 36.46
GEMA-Wikipedia_alt-GEMA_2013-09-26.txt 106.94 192 10.00 0.92 0.71 35.96
GEMA-Wikipedia_neu-GEMA_2015-07-05.txt 111.64 NA 10.08 0.92 0.73 37.47
As you can see, corpusSummary()
returns a data.frame
object with the summarised results of all texts. Here’s an example how to use this to plot interactions:
library(sciplot)
lineplot.CI(
x.factor=corpusSummary(sampleTexts)[["Source"]],
response=corpusSummary(sampleTexts)[["MTLD"]],
group=corpusSummary(sampleTexts)[["Topic"]],
type="l",
main="MTLD",
xlab="Media source",
ylab="Lexical diversity score",
col=c("grey", "black"),
lwd=2
)
There are quite a number of corpus*()
getter/setter methods for slots of these objects, e.g., corpusReadability()
to get the readability()
results from objects of class kRp.corpus
.
The S4 object class provided by tm.plugin.koRpus
directly inherits its structure from kRp.text
of the koRpus
package, adding additional slots for meta information and Corpus
objects of the tm
package for raw data.
Two methods can be especially helpful for further analysis. The first one is tif_as_tokens_df()
and returns a data.frame
including all texts of the tokenized corpus in a format that is compatible with Text Interchange Formats standards.
The second one is a family of [
, [<-
, [[
and [[<-
shorcuts to directly interact with the data.frame
object you would get via taggedText()
.
The object class makes it quite comfortable to analyse type frequencies of corpora. There is a method read.corp.custom()
for these classes, that will do this analysis recursively on all levels:
sampleTexts <- read.corp.custom(sampleTexts, case.sens=FALSE)
sampleTextsWordFreq <- query(
corpusCorpFreq(sampleTexts),
var="wclass",
query=kRp.POS.tags(lang="de", list.classes=TRUE, tags="words")
)
head(sampleTextsWordFreq, 10)
num word lemma tag wclass lttr freq pct pmio log10 rank.avg rank.min
3 3 die word.kRp word 3 30 0.037220844 37220 4.570776 263.0 263
4 4 der word.kRp word 3 21 0.026054591 26054 4.415874 262.0 262
5 5 gema word.kRp word 4 17 0.021091811 21091 4.324097 260.5 260
6 6 und word.kRp word 3 17 0.021091811 21091 4.324097 260.5 260
7 7 einer word.kRp word 5 12 0.014888337 14888 4.172836 258.5 258
8 8 von word.kRp word 3 12 0.014888337 14888 4.172836 258.5 258
11 11 ist word.kRp word 3 10 0.012406948 12406 4.093632 256.0 255
12 12 bei word.kRp word 3 9 0.011166253 11166 4.047898 254.0 254
13 13 das word.kRp word 3 8 0.009925558 9925 3.996731 252.5 252
14 14 urheber word.kRp word 7 8 0.009925558 9925 3.996731 252.5 252
rank.rel.avg rank.rel.min inDocs idf
3 99.24528 99.24528 4 0.00000
4 98.86792 98.86792 4 0.00000
5 98.30189 98.11321 4 0.00000
6 98.30189 98.11321 4 0.00000
7 97.54717 97.35849 4 0.00000
8 97.54717 97.35849 4 0.00000
11 96.60377 96.22642 4 0.00000
12 95.84906 95.84906 4 0.00000
13 95.28302 95.09434 4 0.00000
14 95.28302 95.09434 2 0.30103
In combination with the wordcloud
package, this can directly be used to plot word clouds:
See the file tests/testthat/samples/License\_of\_sample\_texts.txt
for details↩
Future versions of this package might add furter ways of describing your corpus, like using a configuration file or providing a full corpus in XML or JSON format. But don’t hold your breath.↩
see the koRpus
documentation for details.↩