Extract Speeches from Records — extract_speeches_from

The function extract speeches from the Riksdagen Records based on the definition of a speech as the utterances (<u>) coming after a speaker introduction (<note type="speaker">). The function returns the segments of the speech.

For multiple files, parallelism can be used.

extract_speeches_from_record(record_path)

extract_speeches_from_records(
  record_paths,
  mc.cores = getOption("mc.cores", detectCores() - 1L),
  ...
)

assert_and_complement_paths(record_paths)

Arguments

record_path: a file path to a record XML file
record_paths: a vector of file paths to a record XML file
mc.cores: the number of cores to use (Linux and Mac only) in mclapply. Defaults to available cores - 1.
...: further arguments supplied to mclapply.

Value

The function returns a tibble data frame with the following variables:

record_id: The id of the record.
speech_no: The speech number in the record.
speech_id: The id of the XML node to the introduction of the speaker.
who: The id of the person giving the speech.
id: The id of the XML node for the segment of the speech.
text: The speech segment as plain text.

Details

The function checks if there is a file at the record_path. If its not a file, it test to complement with the corpora path