Research results

Working papers

This paper addresses the challenge of evaluating page segmentation methods in the context of extracting historical job advertisements in digitized newspapers. Accurate segmentation is essential for high-quality Optical Character Recognition (OCR) results, yet the methodology for comparing and evaluating segmentation algorithms has received limited attention in Digital Humanities. The paper presents an evaluation framework developed within the JobAds Project, focusing on textual congruence between predicted and ground-truth regions. This is important for an evidence-based segmentation algorithm selection and offers insights into segmented data quality, impacting research outcomes. The paper examines three evaluation features: intersection area, text similarity based on Levenshtein distance, and text presence/absence in non-intersecting parts of the predicted region and its ground truth, revealing their effectiveness through logistic regression models. The method involves manual ground-truth creation, aiming for an automatic metric to quantify textual congruence. Results show that combining the text presence/absence feature with Hausdorff distance achieves the highest performance, reaching an F1 score of 0.957 on the testing subset. The study emphasizes the need for tailored evaluation metrics in Digital Humanities and highlights challenges posed by OCR errors and irregular layouts while underscoring the importance of transparency in research. The proposed evaluation framework offers insights for segmentation assessment in historical newspapers, with further application beyond the specific dataset and use case.

In the last 200 years, the division of labor has increased drastically. The different skills and knowledge need to be combined for production. How is the dispersed knowledge brought to the place where it creates a particularly large value? To assess this matching, we study labor markets as the devise to facilitate such processes in a decentralized manner. We start with our investigation in the middle of the 19th century, which was the beginning of the `modern' labor market and follow the market for 100 years. We use job ads in newspapers as our major data source. The analysis is put into perspectives of emergence, development and functioning of markets as means to facilitate the matching. The labor market was `created' by initiative of many actors including some public actors at later time. The market changed through time without losing robustness and functionality. The changes made increased the stability of matches and follows social preferences.

Working paper

Published Papers

Historical job advertisements provide invaluable insights into the evolution of labor markets and societaldynamics. However, extracting structured information, such as job titles, from these OCRed and unstructuredtexts presents significant challenges. This study evaluates four distinct computational approachesfor job title extraction: a dictionary-based method, a rule-based approach leveraging linguistic patterns,a Named Entity Recognition (NER) model fine-tuned on historical data, and a text generation modeldesigned to rewrite advertisements into structured lists.Our analysis spans multiple versions of the ANNO dataset, including raw OCR, automatically postcorrected,and human-corrected text, as well as an external dataset of German historical job advertisements.Results demonstrate that the NER approach consistently outperforms other methods, showcasingrobustness to OCR errors and variability in text quality. The text generation approach performs well onhigh-quality data but exhibits greater sensitivity to OCR-induced noise. While the rule-based method isless effective overall, it performs relatively well for ambiguous entities. The dictionary-based approach,though limited in precision, remains stable across datasets.This study highlights the impact of text quality on extraction performance and underscores the need foradaptable, generalizable methods. Future work should focus on integrating hybrid approaches, expandingannotated datasets, and improving OCR correction techniques to enhance the extraction of structuredinformation from historical texts. These advancements will enable deeper exploration of labor markettrends and contribute to the broader field of digital humanities.

Paper: jdmdh.episciences.org/15373