Ändra sökning
Länk till posten
Permanent länk

Direktlänk
Domeij, Rickard
Publikationer (10 of 28) Visa alla publikationer
Ahltorp, M., Hessel, J., Eriksson, G., Skeppstedt, M. & Domeij, R. (2022). A Digital Swedish–Yiddish/Yiddish–Swedish Dictionary: A Web-Based Dictionary that is also Available Offline. In: Proceedings of the EURALI Workshop @LREC2022: . Paper presented at LREC 2022.
Öppna denna publikation i ny flik eller fönster >>A Digital Swedish–Yiddish/Yiddish–Swedish Dictionary: A Web-Based Dictionary that is also Available Offline
Visa övriga...
2022 (Engelska)Ingår i: Proceedings of the EURALI Workshop @LREC2022, 2022Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Yiddish is one of the national minority languages of Sweden, and one of the languages for which the Swedish Institute for Language and Folklore is responsible for developing useful language resources. We here describe the web-based version of a Swedish–Yiddish/Yiddish–Swedish dictionary. The single search field of the web-based dictionary is used for incrementally searching all three components of the dictionary entries (the word in Swedish, the word in Yiddish with Hebrew characters and the transliteration in Latin script). When the user accesses the dictionary in an online mode, the dictionary is saved in the web browser, which makes it possible to also use the dictionary offline.

Nationell ämneskategori
Studier av enskilda språk
Identifikatorer
urn:nbn:se:sprakochfolkminnen:diva-2471 (URN)
Konferens
LREC 2022
Forskningsfinansiär
Vetenskapsrådet, 2017-00626
Tillgänglig från: 2022-07-15 Skapad: 2022-07-15 Senast uppdaterad: 2023-12-01Bibliografiskt granskad
Skeppstedt, M., Mattson, M., Ahltorp, M. & Domeij, R. (2022). Converting from the Nordic Terminological Record Format to the TBX Format. In: Proceedings of the TERM21 Workshop, Language Resources and Evaluation Conference (LREC 2022): . Paper presented at Language Resources and Evaluation Conference (LREC 2022).
Öppna denna publikation i ny flik eller fönster >>Converting from the Nordic Terminological Record Format to the TBX Format
2022 (Engelska)Ingår i: Proceedings of the TERM21 Workshop, Language Resources and Evaluation Conference (LREC 2022), 2022Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Rikstermbanken (Sweden’s National Term Bank), which was launched in 2009, uses the Nordic Terminological Record Format (NTRF) for organising its terminological data. Since then, new terminology formats have been established as standards, e.g., the Termbase eXchange format (TBX). We here describe work carried out by the Institute for Language and Folklore within the Federated eTranslation TermBank Network Action. This network develops a technical infrastructure for facilitating sharing of terminology resources throughout Europe. To be able to share some of the term collections of Rikstermbanken within this network and export them to Eurotermbank, we have implemented a conversion from the Nordic Terminological Record Format, as used in Rikstermbanken, to the TBX format.

Nationell ämneskategori
Språk och litteratur
Identifikatorer
urn:nbn:se:sprakochfolkminnen:diva-2472 (URN)
Konferens
Language Resources and Evaluation Conference (LREC 2022)
Tillgänglig från: 2022-07-15 Skapad: 2022-07-15 Senast uppdaterad: 2023-12-01Bibliografiskt granskad
Skeppstedt, M., Domeij, R., Eriksson, G. & Öqvist, J. (2022). Digital humanities for the spreadsheet nerd: Presenting the output of a topic modelling tool as tabular data. In: DHNB 2022 Conference: Book of Abstracts. Paper presented at Digital Humanities in the Nordic and Baltic Countries Conference (DHNB 2022).
Öppna denna publikation i ny flik eller fönster >>Digital humanities for the spreadsheet nerd: Presenting the output of a topic modelling tool as tabular data
2022 (Engelska)Ingår i: DHNB 2022 Conference: Book of Abstracts, 2022Konferensbidrag, Muntlig presentation med publicerat abstract (Refereegranskat)
Nationell ämneskategori
Övrig annan humaniora
Identifikatorer
urn:nbn:se:sprakochfolkminnen:diva-2470 (URN)
Konferens
Digital Humanities in the Nordic and Baltic Countries Conference (DHNB 2022)
Projekt
Tilltal
Tillgänglig från: 2022-07-15 Skapad: 2022-07-15 Senast uppdaterad: 2022-07-29Bibliografiskt granskad
Skeppstedt, M., Ahltorp, M., Eriksson, G. & Domeij, R. (2021). A Pipeline for Manual Annotations of Risk Factor Mentions in the COVID-19 Open Research Dataset. In: Selected Papers from the CLARIN Annual Conference 2020: . Paper presented at CLARIN Annual Conference 2020.
Öppna denna publikation i ny flik eller fönster >>A Pipeline for Manual Annotations of Risk Factor Mentions in the COVID-19 Open Research Dataset
2021 (Engelska)Ingår i: Selected Papers from the CLARIN Annual Conference 2020, 2021Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

We here demonstrate how a set of tools that are being maintained and further developed within the Språkbanken Sam and SWE-CLARIN infrastructures can be employed for creating manually labelled training data in a low-resource setting. As example text, we used the “COVID-19 Open Research Dataset”, and created manually annotated training data for its associated Kaggle task,“What do we know about COVID-19 risk factors?”. We first used our topic modelling tool to i) select a text set for manual annotation, ii) classify the texts into preliminary classification categories, and iii) analyse the texts in search for potential refinements of the annotation categories. We then annotated the text set on a more granular level by labelling the token sequences that indicated the existence of the refined categories in the text. Finally, we used the granularly annotated text set as a seed set, and applied our active learning tool for actively selecting additional texts for annotation. For the token-sequence annotations, we used our text annotation tool, which includes support for incorporating automatic pre-annotations.

Nationell ämneskategori
Språkteknologi (språkvetenskaplig databehandling)
Identifikatorer
urn:nbn:se:sprakochfolkminnen:diva-2075 (URN)
Konferens
CLARIN Annual Conference 2020
Forskningsfinansiär
Vetenskapsrådet, 2017-00626
Tillgänglig från: 2021-10-21 Skapad: 2021-10-21 Senast uppdaterad: 2023-12-01Bibliografiskt granskad
Skeppstedt, M., Ahltorp, M., Domeij, R., Eriksson, G. & Öqvist, J. (2021). Mining for Recurring Themes in Speech Recording Descriptions. In: : . Paper presented at The 9th Swedish Workshop on Data Science.
Öppna denna publikation i ny flik eller fönster >>Mining for Recurring Themes in Speech Recording Descriptions
Visa övriga...
2021 (Engelska)Konferensbidrag, Poster (med eller utan abstract) (Refereegranskat)
Nationell ämneskategori
Språkteknologi (språkvetenskaplig databehandling)
Forskningsämne
Språkteknologi
Identifikatorer
urn:nbn:se:sprakochfolkminnen:diva-2217 (URN)
Konferens
The 9th Swedish Workshop on Data Science
Projekt
TilltalNationella språkbanken
Forskningsfinansiär
Riksbankens Jubileumsfond, SAF16-0917:1
Tillgänglig från: 2021-12-12 Skapad: 2021-12-12 Senast uppdaterad: 2023-12-01Bibliografiskt granskad
Skeppstedt, M., Domeij, R. & Skott, F. (2021). Snippets of Folk Legends: Adapting a Text Mining Tool to a Collection of Folk Legends. In: Post-Proceedings of the 5th Conference Digital Humanities in the Nordic Countries (DHN 2020): . Paper presented at 5th Conference Digital Humanities in the Nordic Countries (DHN 2020).
Öppna denna publikation i ny flik eller fönster >>Snippets of Folk Legends: Adapting a Text Mining Tool to a Collection of Folk Legends
2021 (Engelska)Ingår i: Post-Proceedings of the 5th Conference Digital Humanities in the Nordic Countries (DHN 2020), 2021Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

A topic modelling tool was adapted to requirements for a collection of Swedish folk legends. To offer an overview of a list of folk legend texts, which had been automatically extracted by the topic modelling tool, snippet text versions of the folk legends were displayed. The snippets were constructed from the full-text versions of the legends using the sentences most relevant to the topics extracted by the topic modelling algorithm. In addition, collection-adapted data was constructed for performing a pre-processing of the folk legend texts, before they were submitted to the topic modelling algorithm. This data consisted of a collection-adapted stop word list and word lists for improving the quality of clusters of semantically similar words.

Nationell ämneskategori
Språkteknologi (språkvetenskaplig databehandling)
Forskningsämne
Språkteknologi
Identifikatorer
urn:nbn:se:sprakochfolkminnen:diva-2074 (URN)
Konferens
5th Conference Digital Humanities in the Nordic Countries (DHN 2020)
Projekt
Nationella språkbanken
Forskningsfinansiär
Vetenskapsrådet, 2017-00626
Tillgänglig från: 2021-10-21 Skapad: 2021-10-21 Senast uppdaterad: 2021-12-29Bibliografiskt granskad
Skeppstedt, M., Domeij, R. & Skott, F. (2020). Adapting a Topic Modelling Tool to the Task of Finding Recurring Themes in Folk Legends. In: Reinsone et al. (Ed.), Proceedings of the Digital Humanities in the Nordic Countries 5th Conference (DHN 2020): . Paper presented at Digital Humanities in the Nordic Countries 5th Conference (DHN 2020) (pp. 388-392).
Öppna denna publikation i ny flik eller fönster >>Adapting a Topic Modelling Tool to the Task of Finding Recurring Themes in Folk Legends
2020 (Engelska)Ingår i: Proceedings of the Digital Humanities in the Nordic Countries 5th Conference (DHN 2020) / [ed] Reinsone et al., 2020, s. 388-392Konferensbidrag, Muntlig presentation med publicerat abstract (Refereegranskat)
Abstract [en]

A topic modelling tool, which was originally developed for performing text analysis on very short texts written in English, was adapted to the text genre of Swedish folk legends. The topic modelling tool was configured to use a word space model trained on a Swedish corpus, as well as a Swedish stop word list. The stop word list consisted of standard Swedish stop words, as well as 380 additional stop words that were tailored to the content of the corpus and therefore also included older spelling versions and grammatical forms of Swedish words. The adapted version of the tool was applied on a corpus consisting of around 10,000 Swedish folk legends, which resulted in the automatic extraction of 20 topics. Future versions of the tool will be extended with text summarisation func- tionality, in order to retain the text overview provided by the tool also when it is applied on longer folk legends.

Nationell ämneskategori
Språk och litteratur
Forskningsämne
Språkteknologi
Identifikatorer
urn:nbn:se:sprakochfolkminnen:diva-1813 (URN)
Konferens
Digital Humanities in the Nordic Countries 5th Conference (DHN 2020)
Tillgänglig från: 2020-12-10 Skapad: 2020-12-10 Senast uppdaterad: 2021-12-29Bibliografiskt granskad
Skeppstedt, M., Ahltorp, M., Eriksson, G. & Domeij, R. (2020). Annotating risk factor mentions in the COVID-19 Open Research Dataset. In: Costanza Navarretta and Maria Eskevich (Ed.), Proceedings of CLARIN Annual Conference 2020: . Paper presented at CLARIN Annual Conference (pp. 52-55).
Öppna denna publikation i ny flik eller fönster >>Annotating risk factor mentions in the COVID-19 Open Research Dataset
2020 (Engelska)Ingår i: Proceedings of CLARIN Annual Conference 2020 / [ed] Costanza Navarretta and Maria Eskevich, 2020, s. 52-55Konferensbidrag, Muntlig presentation med publicerat abstract (Refereegranskat)
Abstract [en]

We here describe the creation of manually annotated training data for the Kaggle task “What do we know about COVID-19 risk factors?”. We applied our text mining tool on the “COVID-19 Open Research Dataset” to i) select data for manual annotation, ii) classify the data into initially established classification categories, and iii) analyse our data set in search for potential refinements of the annotation categories. The process resulted in a corpus consisting of 50,000 tokens, for which each token is annotated as to whether it is part of an expression that functions as a “risk factor trigger”. Two types of risk factor triggers were annotated, those indicating that the text describes a risk factor, and those indicating that something could not be shown to be a risk factor.

Nationell ämneskategori
Språk och litteratur
Forskningsämne
Språkteknologi
Identifikatorer
urn:nbn:se:sprakochfolkminnen:diva-1817 (URN)
Konferens
CLARIN Annual Conference
Tillgänglig från: 2020-12-17 Skapad: 2020-12-17 Senast uppdaterad: 2023-12-01Bibliografiskt granskad
Domeij, R., Edlund, J., Eriksson, G., Fallgren, P., David, H., Lindström, E., . . . Öqvist, J. (2020). Exploring the archives for textual entry points to speech: Experiences of interdisciplinary collaboration in making cultural heritage accessible for research. In: Steven Krauwer & Darja Fišer (Ed.), Proceedings of the Twin Talks 2 and 3 Workshops at DHN 2020 and DH 2020: . Paper presented at DHN 2020 (pp. 45-55). Riga, 2717
Öppna denna publikation i ny flik eller fönster >>Exploring the archives for textual entry points to speech: Experiences of interdisciplinary collaboration in making cultural heritage accessible for research
Visa övriga...
2020 (Engelska)Ingår i: Proceedings of the Twin Talks 2 and 3 Workshops at DHN 2020 and DH 2020 / [ed] Steven Krauwer & Darja Fišer, Riga, 2020, Vol. 2717, s. 45-55Konferensbidrag, Publicerat paper (Övrigt vetenskapligt)
Abstract [en]

Tilltal (Tillgängligt kulturarv för forskning i tal, ‘Accessible cultural heritage for speech research’) is a multidisciplinary and methodological project undertaken by the Institute of Language and Folklore, KTH Royal Institute of Technology, and The Swedish National Archives in cooperation with the National Language Bank and SWE-CLARIN [1]. It aims to provide researchers better access to archival audio recordings using methods from language technology. The project comprises three case studies and one activity and usage study. In the case studies, actual research agendas from three different fields (ethnology, sociolinguistics, and interaction analysis) serve as a basis for identifying procedures that may be simplified with the aid of digital tools. In the activity and usage study, we are applying an activity-theoretical approach with the aim of involving researchers and investigating how they use – and would like to be able to use – the archival resources at ISOF. Involving researchers in participatory design ensures that digital solutions are suggested and evaluated in relation to the requirements expressed by researchers engaged in specific research tasks[2].In this paper, we focus on one of the case studies, which investigates the process by which personal experience narratives are transformed into cultural heritage [3], and account for our results in exploring how different types of text material from the archives can be used to find relevant sections of the audio recordings. Finally, we discuss what lessons can be learned, and what conclusions can be drawn, from our experiences of interdisciplinary collaboration in the project.

Ort, förlag, år, upplaga, sidor
Riga: , 2020
Serie
CEUR Workshop Proceedings, ISSN 1613-0073
Nationell ämneskategori
Humaniora och konst Teknik och teknologier
Forskningsämne
Språkteknologi; Folkloristik; Dialektforskning
Identifikatorer
urn:nbn:se:sprakochfolkminnen:diva-1816 (URN)
Konferens
DHN 2020
Forskningsfinansiär
Riksbankens Jubileumsfond, SAF16-0917:1
Tillgänglig från: 2020-12-17 Skapad: 2020-12-17 Senast uppdaterad: 2022-06-07Bibliografiskt granskad
Skeppstedt, M., Ahltorp, M., Eriksson, G. & Domeij, R. (2020). Line-a-line: A Tool for Annotating Word-Alignment. In: Reinhard Rapp, Pierre Zweigenbaum och Serge Sharoff (Ed.), Proceedings of the 13th Workshop on Building and Using Comparable Corpora: . Paper presented at 13th Workshop on Building and Using Comparable Corpora, LREC (pp. 1-5).
Öppna denna publikation i ny flik eller fönster >>Line-a-line: A Tool for Annotating Word-Alignment
2020 (Engelska)Ingår i: Proceedings of the 13th Workshop on Building and Using Comparable Corpora / [ed] Reinhard Rapp, Pierre Zweigenbaum och Serge Sharoff, 2020, s. 1-5Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

We here describe line-a-line, a web-based tool for manual annotation of word-alignments in sentence-aligned parallel corpora. The graphical user interface, which builds on a design template from the Jigsaw system for investigative analysis, displays the words from each sentence pair that is to be annotated as elements in two vertical lists. An alignment between two words is annotated by drag-and-drop, i.e. by dragging an element from the left-hand list and dropping it on an element in the right-hand list. The tool indicates that two words are aligned by lines that connect them and by highlighting associated words when the mouse is hovered over them. Line-a-line uses the efmaral library for producing pre-annotated alignments, on which the user can base the manual annotation. The tool is mainly planned to be used on moderately under-resourced languages, for which resources in the form of parallel corpora are scarce. The automatic word-alignment functionality therefore also incorporates information derived from non-parallel resources, in the form of pre-trained multilingual word embeddings from the MUSE library.

Nationell ämneskategori
Språk och litteratur
Forskningsämne
Språkteknologi
Identifikatorer
urn:nbn:se:sprakochfolkminnen:diva-1812 (URN)
Konferens
13th Workshop on Building and Using Comparable Corpora, LREC
Tillgänglig från: 2020-12-10 Skapad: 2020-12-10 Senast uppdaterad: 2023-12-01Bibliografiskt granskad
Organisationer

Sök vidare i DiVA

Visa alla publikationer