Data sources for text and data mining

Text and data mining (TDM) refers to algorithm-based processes to automatically extract information from unstructured or semi-structured text data (text mining) and structured data (data mining).  
On this page you will find text and data mining resources – ordered by content category – which are either freely available on the web or through UB Bern’s licenses. 

Unless other contact details are provided, if you are interested in obtaining data please refer to UB Bern

Documents from past events on TDM: 

Licensed data, text and image collections
Resource Contents Detailed information
Swiss Media content:  

Swissdox@LiRI (general information on the Swissdox database

  • Bulk download of full texts from Swissdox (Swiss media database): TSV, XML 
  • Approx. 23 million articles from 250 Swiss newspapers and (online) media products 
  • Dating from 1910, updated daily 
  • Access via SWITCH edu-ID 
  • Use also possible via API 

WBIS Online (DeGruyter) (general information about the database
  • Biographical records on over six million historic and contemporary persons
  • Updated continuously. Includes 8.5 million digital facsimile articles from biographical reference works.
  • Multilingual  
Germanistik Online (DeGruyter) (general information about the database
  • 400,000 bibliographical entries, updated continuously  
Romance Studies Bibliography (DeGruyter) (general information about the database
  • 400,000 bibliographical entries, updated continuously  
Books International: HathiTrust Research Center 
  • 17 million digitized volumes from US libraries (from 1700) 
  • own corpus creation and download in preprocessed form (Derived Datasets) 
  • simple implemented text analysis routines and visualizations 
  • virtual machines for data analysis  
  • preprocessed datasets for English-language literature 
Cambridge Histories (CUP) 
  • Over 400 volumes of international history (English) 
  • PDF (download), XML (on request) 
  • IP-driven access (University network/VPN) 
  • General information about the database 
English-language periodicals (Gale Cengage) 
  • The Times Digital Archive 1785-2014 general information about the database 
  • International Herald Tribune 1887-2013, general information about the database 
  • The Economist Historical Archive 1843-2015, general information about the database 
English-language periodicals (ProQuest) 
  • British Periodicals: 491 newspapers/magazines from the UK, Ireland, India, 1681-2007, 6.7 million articles, JPEG, PDF, OCR/XML, general information about the database 
  • American Periodicals: 1,509 newspapers/magazines and scientific journals, North America, 1741-1988, 11.5 million articles, PDF, OCR/XML, general information about the database 
English-language monographs (Gale Cengage) 
  • Eighteenth Century Collections Online (ECCO), general information about the database 
  • Nineteenth Century Collections Online (NCCO): British Theatre, Music and Literature, general information about the database 
  • Nineteenth Century Collections Online (NCCO): Europe and Africa, general information about the database 
UK Parliamentary Papers (ProQuest) 
  • UK Parliamentary Papers from the 18th-20th century 
  • XML, PDF 
  • General information about the database 
Free accessible data, text and immage collections
Platform Contents Detailed information
e-rara  
  • 100,000 historic and rare printed publications from Swiss institutions 
  • Full texts: PDF, some TXT 
  • Jupyter Notebook for bulk downloads of metadata and full texts 
Overview of data interfaces and terms 
e-manuscripta  
  • 150,000 manuscript materials  from Swiss institutions 
  • Full texts: PDF 
  • Jupyter Notebook for bulk downloads of metadata and full texts 
Overview of data interfaces and terms 
e-periodica 
  • 900 journals from Switzerland 
  • Full texts: PDF 
  • Jupyter Notebook for bulk downloads of metadata and full texts including text parsing 
Overview of data interfaces and terms 
Chronicling America  Freely accessible, public domain 
CLARIN Resource Families 
Website 
  • Overview and, in some cases, access to language corpora in all subject areas and many languages 
Partly available for free, various licenses 
Deutsches Textarchiv 
  • A comprehensive collection of key texts from the 17th to the 19th century. 
  • DTA Core Corpus (1500 titles) 
  • DTA Expansion Corpus (approximately 4000 additional sources) 
  • Dumps of various subcorpora by time period and genre 
  • Metadata (Dublin Core), full texts (TEI, TCF, TXT) 
Freely accessible, CC-BY-SA 
GLAM Workbench 
Website 
  • Comprehensive datasets from Australian and New Zealand heritage institutions, web archives and government documents. 
  • API documentation, bulk downloads and Jupyter notebooks 
Freely accessible, various licenses 
Internet Archive 
Documentation 
  • 37 million books and texts in a variety of genres, languages and data formats 
  • Bulk download with command line tool and Python wrapper 
Freely accessible, various licenses, sometimes not specified 
OpenGLAM Survey
Overview 
  • Overview of open data sources (digital reproductions, texts, metadata) of 1,600 cultural heritage institutions worldwide, with details of licenses and APIs 
Freely accessible, public domain or open licenses 
Project Gutenberg 
Documentation 
  • 70,000 books in a variety of genres, languages and data formats 
Freely accessible, public domain 
Text Creation Partnership 
  • 73'000 public domain transcribed full texts (SGML/XML/TEI) of prints of the 15th-18th century as bulk downloads (single files also in the Oxford Text Archive: EBUP, HTML, XML, partly also POS-annotated as TSV) 
  • Early English Books Online (EEBO, 60'000 transcribed full texts, 1473-1700) 
  • Eighteenth-Century Collections Online (ECCO, 3,000 transcribed full texts, 1700-1800) 
  • Evans Early American Imprints (Evans, 5,000 transcribed full texts, 1640-1800) 
Freely accessible, public domain 

The resources and their interfaces are subject to various legal and technical terms of use. Please consult these before any automated access. In particular, automated access is often excluded for licensed content that is not listed here and may cause the provider to block access to the database. Please contact us to check the legality of access if you are in any doubt. 

According to the Swiss Federal Act on Copyright and Related Rights, duplication and storage of legally accessible content for scientific purposes as in the context of TDM is permitted. 

The use of e-media or parts thereof in combination with artificial intelligence (AI) technologies is in many cases contractually prohibited. If you are planning to use AI in this way, you must contact us in advance to clarify the relevant framework conditions. 

For any questions or clarifications, please reach out to us.