#acl +All:read Default = LocalGovPL (Korpus Debat Samorządowych) = LocalGovPL is a large-scale, speaker-annotated corpus of Polish local government meeting transcripts processed using an automatic two-stage LLM pipeline. The corpus consists of 31,899 sessions from 749 councils recorded between 2018 and 2025 (approximately 363M words). It is released in TEI P5 format with explicit links between utterances and registered participants. The corpus covers various levels of local administration – municipalities (PL ''gminy''), counties (PL ''powiaty''), cities (PL ''miasta''), and regional assemblies (PL ''sejmiki województw'') – including both plenary sessions and committee meetings. The primary goal of the resource is to facilitate research on the language of local governance, including studies of argumentation, interactional patterns, policy framing, and social dynamics within institutional dialogue. Beyond linguistic research, the corpus supports applications in speech-to-text alignment, automatic summarization, speaker role identification, and computational social science. == Data Sources == The raw transcripts were collected from two main publicly available sources: 1. Websites maintained by local administrative bodies – a set of specialized HTML extraction parsers was implemented to retrieve and normalise transcripts. 1. [[https://esesja.tv/|eSesja.tv]] – the meeting streaming platform used by local governments, from which transcription files in WebVTT format were downloaded. The dataset covers meetings from November 2018 to June 2025 and includes several thousand hours of deliberation. Due to the decentralised publication practices of local institutions, the source transcripts exhibit substantial variability in format, structure, and language conventions. The preprocessing stage included normalisation of document encoding, removal of irrelevant metadata (e.g., agenda headers or timestamps), and segmentation into individual utterance candidates. == Processing Pipeline == The automatic structuring pipeline consists of two main stages, both powered by large language models (LLMs). === Stage 1: Speaker Extraction === Potential speaker names are identified using a combination of rule-based name recognition and contextual inference performed by LLMs. The models are prompted to detect person names and administrative roles, e.g., Chairperson (PL ''Przewodniczący''), Mayor (PL ''Burmistrz''), Councilor (PL ''Radny''), ensuring both high recall and accurate disambiguation in cases of title repetition or partial name mentions. === Stage 2: Utterance Attribution === The LLMs are then used to assign each utterance segment to one of the previously extracted speakers. This stage requires interpreting discourse cues such as addressing forms, transitions, and speaker introductions. The output is a fully structured transcript in which each utterance is associated with a speaker identifier (speaker name, role, and meeting session). === Processing Configuration === For the public release, both stages were executed end-to-end with !DeepSeek-chat-v3-0324. Long transcripts were processed with a chunking strategy (threshold >1,500 lines, approximately 60,000 characters) and merged by global line numbers. === Throughput and Cost === || '''Metric''' || '''Value''' || || Transcripts processed || 31,899 || || Total input tokens || ~1,100,000,000 || || Total output tokens || ~55,000,000 || || Total processing time (days) || 16.82 || || Total cost (USD) || 373.18 || || Avg input tokens per transcript || 34,038.3 || || Avg output tokens per transcript || 1,742.3 || || Avg generation time (s) || 41.964 || || Avg cost per transcript (USD) || 0.01078 || == Corpus Statistics == The LocalGovPL corpus represents a substantial collection of local government meeting transcripts, spanning over seven years of administrative proceedings across 749 councils. || '''Category''' || '''Count''' || '''Average per Session''' || || '''Basic Statistics''' || || || || Total transcripts || 31,899 || – || || Date range || 2018-11 to 2025-06 || – || || Number of councils || 749 || – || || Transcripts per council || – || 42.59 || || '''Duration Statistics''' || || || || Average session duration || – || 2.23 hours || || '''Content Statistics''' || || || || Total words || 362,664,794 || 11,369 || || Total characters || 2,468,439,776 || 77,383 || || '''Speaker Statistics''' || || || || Average speakers per session || – || 12.77 || || Average utterances per session || – || 80.2 || == Corpus Format == Corpus files are made available in '''XML TEI P5''' format, following the same design choices as the [[https://clip.ipipan.waw.pl/PPC|Polish Parliamentary Corpus (PPC)]], ensuring interoperability with existing tools and facilitating cross-corpus comparisons. Each meeting transcription is represented by a pair of XML files: === Session Header (header.xml) === The `header.xml` file contains the TEI header with document-level metadata and the participant registry, including: * '''`title`''' – meeting title used as the document name (e.g., ''Sesja Rady 30 stycznia 2019'' / Council Session on January 30, 2019) * '''`publisher`''' – the organising body responsible for the session (e.g., ''Rada Miejska Nowego Miasta Lubawskiego'' / Municipal Council of Nowe Miasto Lubawskie) * '''`system`''' – source system label for provenance tracking (e.g., ''Sesja Rady Lokalnej'' / Local Council Session) * '''`house`''' – assembly or chamber type (e.g., ''Rada Powiatu'' / County Council) * '''`sitting ID`''' – numeric identifier of the sitting * '''`type`''' – content type of the source (e.g., ''Transkrypcja sesji'' / Session transcript) * '''`total rows`''' – number of input transcript rows prior to structuring * '''`speaker count`''' – number of distinct speakers recognised in the session * '''`date`''' – session date in ISO format (e.g., 2019-01-30) Each '''`person`''' in the participant list is uniquely identified and carries a normalised name and role: * `person[@xml:id]` provides a stable identifier (e.g., `chairman_of_municipal_council`) * `persName` holds the display name (e.g., ''Przewodniczący Rady Miejskiej'' / Chairman of the Municipal Council) * `@role` encodes the role (e.g., ''Burmistrz Gminy'' / Mayor of the Municipality) === Utterance Structure (text_structure.xml) === The `text_structure.xml` file contains the speech content segmented into `