
arXiv.orgTransforming Sensitive Documents into Quantitative Data: An AI-Based Preprocessing Toolchain for Structured and Privacy-Conscious AnalysisUnstructured text from legal, medical, and administrative sources offers a rich but underutilized resource for research in public health and the social sciences. However, large-scale analysis is hampered by two key challenges: the presence of sensitive, personally identifiable information, and significant heterogeneity in structure and language. We present a modular toolchain that prepares such text data for embedding-based analysis, relying entirely on open-weight models that run on local hardware, requiring only a workstation-level GPU and supporting privacy-sensitive research.
The toolchain employs large language model (LLM) prompting to standardize, summarize, and, when needed, translate texts to English for greater comparability. Anonymization is achieved via LLM-based redaction, supplemented with named entity recognition and rule-based methods to minimize the risk of disclosure. We demonstrate the toolchain on a corpus of 10,842 Swedish court decisions under the Care of Abusers Act (LVM), comprising over 56,000 pages. Each document is processed into an anonymized, standardized summary and transformed into a document-level embedding. Validation, including manual review, automated scanning, and predictive evaluation shows the toolchain effectively removes identifying information while retaining semantic content. As an illustrative application, we train a predictive model using embedding vectors derived from a small set of manually labeled summaries, demonstrating the toolchain's capacity for semi-automated content analysis at scale.
By enabling structured, privacy-conscious analysis of sensitive documents, our toolchain opens new possibilities for large-scale research in domains where textual data was previously inaccessible due to privacy and heterogeneity constraints.