Yiddish is one of the national minority languages of Sweden, and one of the languages for which the Swedish Institute for Language and Folklore is responsible for developing useful language resources. We here describe the web-based version of a Swedish–Yiddish/Yiddish–Swedish dictionary. The single search field of the web-based dictionary is used for incrementally searching all three components of the dictionary entries (the word in Swedish, the word in Yiddish with Hebrew characters and the transliteration in Latin script). When the user accesses the dictionary in an online mode, the dictionary is saved in the web browser, which makes it possible to also use the dictionary offline.
Tilltal (Tillgängligt kulturarv för forskning i tal, ‘Accessible cultural heritage for speech research’) is a multidisciplinary and methodological project undertaken by the Institute of Language and Folklore, KTH Royal Institute of Technology, and The Swedish National Archives in cooperation with the National Language Bank and SWE-CLARIN [1]. It aims to provide researchers better access to archival audio recordings using methods from language technology. The project comprises three case studies and one activity and usage study. In the case studies, actual research agendas from three different fields (ethnology, sociolinguistics, and interaction analysis) serve as a basis for identifying procedures that may be simplified with the aid of digital tools. In the activity and usage study, we are applying an activity-theoretical approach with the aim of involving researchers and investigating how they use – and would like to be able to use – the archival resources at ISOF. Involving researchers in participatory design ensures that digital solutions are suggested and evaluated in relation to the requirements expressed by researchers engaged in specific research tasks[2].In this paper, we focus on one of the case studies, which investigates the process by which personal experience narratives are transformed into cultural heritage [3], and account for our results in exploring how different types of text material from the archives can be used to find relevant sections of the audio recordings. Finally, we discuss what lessons can be learned, and what conclusions can be drawn, from our experiences of interdisciplinary collaboration in the project.
In this experiment, we find that features which model interaction andconversational behaviour contribute well to identifying sexual grooming behaviourin chat and forum text. Together with the obviously useful lexical features —which we find are more valuable if separated by who generates them — weachieve very successful results in identifying behavioural patterns which maycharacterise sexual grooming. We conjecture that the general framework can beused for other purposes than this specific case if the lexical features are exchangedfor other topical models, the conversational features characterise interaction andbehaviour rather than topical choice.
We here demonstrate how a set of tools that are being maintained and further developed within the Språkbanken Sam and SWE-CLARIN infrastructures can be employed for creating manually labelled training data in a low-resource setting. As example text, we used the “COVID-19 Open Research Dataset”, and created manually annotated training data for its associated Kaggle task,“What do we know about COVID-19 risk factors?”. We first used our topic modelling tool to i) select a text set for manual annotation, ii) classify the texts into preliminary classification categories, and iii) analyse the texts in search for potential refinements of the annotation categories. We then annotated the text set on a more granular level by labelling the token sequences that indicated the existence of the refined categories in the text. Finally, we used the granularly annotated text set as a seed set, and applied our active learning tool for actively selecting additional texts for annotation. For the token-sequence annotations, we used our text annotation tool, which includes support for incorporating automatic pre-annotations.
We here describe the creation of manually annotated training data for the Kaggle task “What do we know about COVID-19 risk factors?”. We applied our text mining tool on the “COVID-19 Open Research Dataset” to i) select data for manual annotation, ii) classify the data into initially established classification categories, and iii) analyse our data set in search for potential refinements of the annotation categories. The process resulted in a corpus consisting of 50,000 tokens, for which each token is annotated as to whether it is part of an expression that functions as a “risk factor trigger”. Two types of risk factor triggers were annotated, those indicating that the text describes a risk factor, and those indicating that something could not be shown to be a risk factor.
We here describe line-a-line, a web-based tool for manual annotation of word-alignments in sentence-aligned parallel corpora. The graphical user interface, which builds on a design template from the Jigsaw system for investigative analysis, displays the words from each sentence pair that is to be annotated as elements in two vertical lists. An alignment between two words is annotated by drag-and-drop, i.e. by dragging an element from the left-hand list and dropping it on an element in the right-hand list. The tool indicates that two words are aligned by lines that connect them and by highlighting associated words when the mouse is hovered over them. Line-a-line uses the efmaral library for producing pre-annotated alignments, on which the user can base the manual annotation. The tool is mainly planned to be used on moderately under-resourced languages, for which resources in the form of parallel corpora are scarce. The automatic word-alignment functionality therefore also incorporates information derived from non-parallel resources, in the form of pre-trained multilingual word embeddings from the MUSE library.
We here describe data from the SB Sam Language Bank, one of three divisions within the Swedish language technology and research infrastructure The National Language Bank of Sweden. The SB Sam Language Bank aims at making data gathered by the Institute for Language and Folklore more available, i.e. folklore and dialect archives, terms and dictionaries, as well as language data produced at other Swedish public agencies. The data gathered from public agencies in SB Sam consists of three main repositories, (i) translation memories, i.e. sentence-aligned texts in different languages that have either been extracted from translation tools or from automatically sentence-aligned parallel texts, (ii) terms gathered from public agencies, and (iii) parallel texts in several languages, that either have been crawled from public agency web sites or that were received directly from the agency.