Tools

A full range of tools for data-driven research is available. In addition to many free and open-source services, there are also proprietary platforms. UB Bern develops its own tools as necessary, and offers licenses and guidance concerning text and data mining platforms. 

UB Bern’s DS Digital Toolbox offers Jupyter Notebooks for an easy introduction to the typical tasks involved in data work, including: 

  • Using the APIs of publishers, databases and data aggregators 
  • Cleaning up spreadsheet data 
  • Extracting text from PDFs and using text recognition (OCR) 
  • Segmenting documents in preparation for OCR 
  • Natural Language Processing (NLP) 
  • Querying and evaluating library metadata using SRU 

We also offer Notebooks for the national platforms e-rara, e-manuscripta and e-periodica to query the metadata and full texts of these Swiss cultural heritage institutions. 

Constellate is the provider Ithaka's text analysis platform. The text collection provided includes the archives of JSTOR and Chronicling America. Users can put together large corpora and download them in the form of metadata, full texts and n-grams. Constellate offers a series of tutorials as an introduction to Python and Natural Language Processing (NLP) which are also available as Jupyter Notebooks. 

To use Constellate you need to access it from the University of Bern’s network or VPN and also set up a personal account. 

The HTRC enables TDM methods to be applied to the contents of the HathiTrust Digital Library which comprises over 17 million digitalized volumes dating back to 1700. Users can create corpora to suit their own criteria and process them with the text analysis routines provided, or they can use their own algorithms on them. Various tools and comprehensive documentation are made available for this. 

To use the HathiTrust Research Center (HTRC), you must be authenticated by SWITCH edu-ID and set up a personal account on HathiTrust/HTRC. 

 

OpenRefine is open source software with an intuitive user interface for easy manipulations of spreadsheet data. OpenRefine provides many data cleaning and transformation functions which, thanks to the processing history, are easy to document and reproduce. One notable feature is the “Reconciliation” function which allows users to check and enrich their own data against external data providers (e.g. Wikidata, an integrated authority file, CrossRef). 

OpenRefine is available for multiple operating systems and can be tried out online here without requiring installation. 

See also the slides (in German) from the workshop on Introduction to OpenRefine (2021/22). 

Jupyter is an open-source integrated development environment (IDE) for various data science programming languages. Jupyter takes a literate programming approach by combining code and documentation in one document (Jupyter Notebook). This allows analysis steps to be explained in detail, visualizations to be integrated directly into the file, and content to be exported to a variety of formats. 

Jupyter can be tried out online here with a variety of kernels. EPFL provides an online JupyterHub environment for associated Swiss universities and research institutes. 

Search/Retrieve via URL (SRU) is a protocol for internet searches using CQL. This allows you to perform catalog queries directly in a browser via a URL (e.g., without swisscovery). A Jupyter Notebook is available to extract the desired control and subfields from the MARCXML. 

Digital Scholarship Tool Collections
Digital Scholarship Tool Collections
Text analysis, Natural Language Processing (NLP), literature analysis 
Digital Humanities