When we submitted our RJ application in the spring of 2022, we had already created a pilot corpus of all Swedish parliamentary debates since 1920. This pre-work was conducted over a two-year period within the Welfare State Analytics project and contained four main annotation tags: (1) speaker introduction, (2) matching the speaker introduction to the correct member of parliament (MP), (3) the speech text, and (4) the non-speech text (e.g. descriptions of activities in the chamber).

We recurrently released improved versions of the annotated corpora through our agile corpus curation workflow in GitHub. Our initial aim was to reach 90 % accuracy in speaker identification, which we assessed as a good enough quality to start researching the material. To succeed in doing this, we used a combination of machine learning techniques and manual curation work.

The graph below illustrates how the pilot corpus (often) improved regarding the estimated accuracy for identification of speakers in each version release, up until version 0.4.6.

0 4 6

Now we have reached the 0.8 version, which includes the whole bicameral period (1867–1970) and unicameral period (1971–). The new thing is of course that we have the first version of the debates from 1867–1919. As shown from the graph below, the estimated accuracy for identification of speakers is often above 95 % for the period since 1920 but differs from 70–90 % for the earlier period (blue line). However, this is the first version of the whole speech corpus and we of course intend to continue working on the quality and release new versions with new improvements!

Skärmavbild 2023-04-12 kl 16 15 00