Data sources for text and data mining

Text and data mining (TDM) refers to algorithm-based processes to automatically extract information from unstructured or semi-structured text data (text mining) and structured data (data mining). 
On this page you will find text and data mining resources – ordered by content category – which are either freely available on the web or through UB Bern’s licenses.

Unless other contact details are provided, if you are interested in obtaining data please refer to UB Bern.

Documents from past events on TDM:

Text and Data Mining: A First View (2021, Slides in English)

Text- und Datamining in den Sozialwissenschaften (2022, Slides in German)

Licensed data, text and image collections

Licensed data, text and image collections
Resource	Contents	Detailed information
Swiss Media content:  Swissdox@LiRI (general information on the Swissdox database)	Bulk download of full texts from Swissdox (Swiss media database): TSV, XML Approx. 23 million articles from 250 Swiss newspapers and (online) media products Dating from 1910, updated daily Access via SWITCH edu-ID Use also possible via API	Manual Title list Terms of use
Books international: HathiTrust Research Center	17 million digitized volumes from US libraries (dating from 1700 onwards) Creation of own corpora and download in pre-processed form (derived datasets) Simple text analysis routines and visualizations Virtual machines for data analysis Pre-processed datasets for English-language literature	Website, documentation Overview of services Authentication via SWITCH edu-ID and personal account with HathiTrust/HTRC
Journal: JSTOR Text Analysis Support (general information about the database)	12 million bibliographic records and full texts from licensed journals and open access eBooks Personal account required	Text analysis support \| JSTOR Quick start guides Research requirements
WBIS Online (DeGruyter) (general information about the database)	Biographical records on over six million historic and contemporary persons Updated continuously. Includes 8.5 million digital facsimile articles from biographical reference works. Multilingual	WBIS
Germanistik Online (DeGruyter) (general information about the database)	400,000 bibliographical entries, updated continuously	Germanistik Online
Romance Studies Bibliography (DeGruyter) (general information about the database)	400,000 bibliographical entries, updated continuously	Romance Studies Bibliography
Books International: HathiTrust Research Center	17 million digitized volumes from US libraries (from 1700) own corpus creation and download in preprocessed form (Derived Datasets) simple implemented text analysis routines and visualizations virtual machines for data analysis  preprocessed datasets for English-language literature	Website, Documentation Overview of services Authentication via SWITCH edu-ID and personal account at HathiTrust/HTRC
Cambridge Histories (CUP)	Over 400 volumes of international history (English) PDF (download), XML (on request) IP-driven access (University network/VPN) General information about the database	Cambridge Histories
English-language periodicals (Gale Cengage)	The Times Digital Archive 1785-2014 general information about the database International Herald Tribune 1887-2013, general information about the database The Economist Historical Archive 1843-2015, general information about the database	Times Digital Archive International Herald Tribune Economist Historical Archive
English-language periodicals (ProQuest)	British Periodicals: 491 newspapers/magazines from the UK, Ireland, India, 1681-2007, 6.7 million articles, JPEG, PDF, OCR/XML, general information about the database American Periodicals: 1,509 newspapers/magazines and scientific journals, North America, 1741-1988, 11.5 million articles, PDF, OCR/XML, general information about the database	British Periodicals American Periodicals
English-language monographs (Gale Cengage)	Eighteenth Century Collections Online (ECCO), general information about the database Nineteenth Century Collections Online (NCCO): British Theatre, Music and Literature, general information about the database Nineteenth Century Collections Online (NCCO): Europe and Africa, general information about the database	ECCO NCCO British Theatre NCCO Europa Africa
UK Parliamentary Papers (ProQuest)	UK Parliamentary Papers from the 18th-20th century XML, PDF General information about the database	Parliamentary Papers

Free accessible data, text and image collections

Free accessible data, text and immage collections
Platform	Contents	Detailed information
e-rara	100,000 historic and rare printed publications from Swiss institutions Full texts: PDF, some TXT Jupyter Notebook for bulk downloads of metadata and full texts	Overview of data interfaces and terms
e-manuscripta	150,000 manuscript materials  from Swiss institutions Full texts: PDF Jupyter Notebook for bulk downloads of metadata and full texts	Overview of data interfaces and terms
e-periodica	900 journals from Switzerland Full texts: PDF Jupyter Notebook for bulk downloads of metadata and full texts including text parsing	Overview of data interfaces and terms
swisscollections	Metacatalog for 14 institutions Manuscripts, archival collections, old prints, musical materials, image collections, maps, and bibliographies Data export: overview list (CSV), metadata package (ZIP), and SRU interface	Terms of Use
Chronicling America	18 million pages of 3,444 newspaper titles from the USA, 1777-1963 Bulk download of scans and full texts (ALTO, JP2/JPEG, PDF, TXT) Bulk downloads of images via Newspaper Navigator Dataset	Freely accessible, public domain
CLARIN Resource Families Website	Overview and, in some cases, access to language corpora in all subject areas and many languages	Partly available for free, various licenses
Deutsches Textarchiv	A comprehensive collection of key texts from the 17th to the 19th century. DTA Core Corpus (1500 titles) DTA Expansion Corpus (approximately 4000 additional sources) Dumps of various subcorpora by time period and genre Metadata (Dublin Core), full texts (TEI, TCF, TXT)	Freely accessible, CC-BY-SA
GLAM Workbench Website	Comprehensive datasets from Australian and New Zealand heritage institutions, web archives and government documents. API documentation, bulk downloads and Jupyter notebooks	Freely accessible, various licenses
Internet Archive Documentation	37 million books and texts in a variety of genres, languages and data formats Bulk download with command line tool and Python wrapper	Freely accessible, various licenses, sometimes not specified
OpenGLAM Survey Overview	Overview of open data sources (digital reproductions, texts, metadata) of 1,600 cultural heritage institutions worldwide, with details of licenses and APIs	Freely accessible, public domain or open licenses
Project Gutenberg Documentation	70,000 books in a variety of genres, languages and data formats	Freely accessible, public domain
Text Creation Partnership	73'000 public domain transcribed full texts (SGML/XML/TEI) of prints of the 15th-18th century as bulk downloads (single files also in the Oxford Text Archive: EBUP, HTML, XML, partly also POS-annotated as TSV) Early English Books Online (EEBO, 60'000 transcribed full texts, 1473-1700) Eighteenth-Century Collections Online (ECCO, 3,000 transcribed full texts, 1700-1800) Evans Early American Imprints (Evans, 5,000 transcribed full texts, 1640-1800)	Freely accessible, public domain

Legal aspects

The resources and their interfaces are subject to various legal and technical terms of use. Please consult these before any automated access. In particular, automated access is often excluded for licensed content that is not listed here and may cause the provider to block access to the database. Please contact us to check the legality of access if you are in any doubt.

According to the Swiss Federal Act on Copyright and Related Rights, duplication and storage of legally accessible content for scientific purposes as in the context of TDM is permitted.

The use of e-media or parts thereof in combination with artificial intelligence (AI) technologies is in many cases contractually prohibited. If you are planning to use AI in this way, you must contact us in advance to clarify the relevant framework conditions.

For any questions or clarifications, please reach out to us.

University Library of Bern UB

Data sources for text and data mining

Licensed data, text and image collections

Free accessible data, text and image collections

Legal aspects