Tools
A full range of tools for data-driven research is available. In addition to many free and open-source services, there are also proprietary platforms. UB Bern develops its own tools as necessary, and offers licenses and guidance concerning text and data mining platforms.
DS Digital Toolbox
DS Digital Toolbox
The DS Digital Toolbox of the University Library Bern offers Jupyter Notebooks for typical tasks when working with data: - Use of APIs of catalogues, full-text platforms and databases: Swisscovery, E-Rara, E-Manuscripta, E-Periodica, Crossref, OpenAlex, Swissdox@LiRI - Data cleansing of data spreadsheets - Segmentation of documents in preparation for OCR - Reading text from PDFs and text recognition (OCR) - Natural Language Processing (NLP) Basics - Querying and analysing library metadata using SRU
UB Bern’s DS Digital Toolbox offers Jupyter Notebooks for an easy introduction to the typical tasks involved in data work, including:
- Using the APIs of publishers, databases and data aggregators
- Cleaning up spreadsheet data
- Extracting text from PDFs and using text recognition (OCR)
- Segmenting documents in preparation for OCR
- Natural Language Processing (NLP)
- Querying and evaluating library metadata using SRU
We also offer Notebooks for the national platforms e-rara, e-manuscripta and e-periodica to query the metadata and full texts of these Swiss cultural heritage institutions.
Constellate
Constellate
Constellate is the text analysis platform of the provider Ithaka. The available text inventory includes the archives of JSTOR (scientific journals) and Chronicling America (newspapers). Extensive corpora can be compiled and downloaded as metadata, full texts and N-grams. Constellate offers a series of introductory tutorials to Python and Natural Language Processing (NLP), which are also available as Jupyter Notebooks. In order to use Constellate, you must access it from the network or VPN of the University of Bern and also create a personal account.
Constellate is the provider Ithaka's text analysis platform. The text collection provided includes the archives of JSTOR and Chronicling America. Users can put together large corpora and download them in the form of metadata, full texts and n-grams. Constellate offers a series of tutorials as an introduction to Python and Natural Language Processing (NLP) which are also available as Jupyter Notebooks.
To use Constellate you need to access it from the University of Bern’s network or VPN and also set up a personal account.
HathiTrust Research Center (HTRC)
HathiTrust Research Center (HTRC)
The HTRC enables the application of TDM methods to the contents of the HathiTrust Digital Library, which contains over 18 million digitised volumes from 1700 onwards. Corpora can be created according to your own criteria and processed with implemented text analysis routines. It is also possible to use your own algorithms. Various tools and comprehensive documentation are available for this purpose. To use HathiTrust Research Centre (HTRC), authentication via SWITCH edu-ID is required and a personal account must be created with HathiTrust/HTRC.
The HTRC enables TDM methods to be applied to the contents of the HathiTrust Digital Library which comprises over 17 million digitalized volumes dating back to 1700. Users can create corpora to suit their own criteria and process them with the text analysis routines provided, or they can use their own algorithms on them. Various tools and comprehensive documentation are made available for this.
To use the HathiTrust Research Center (HTRC), you must be authenticated by SWITCH edu-ID and set up a personal account on HathiTrust/HTRC.
OpenRefine
OpenRefine
OpenRefine is an open-source software with an intuitive user interface for the manipulation of tabular data. OpenRefine provides extensive functions for data cleansing and transformation, which are easy to document and reproduce thanks to the processing history. A special feature is the "Reconciliation" function, which can be used to check and enrich your own data against external data providers (e.g. Wikidata, Gemeinsame Normdatei, FactGrid, ORCID, Getty). OpenRefine is available for several operating systems and can be tested online without having to be installed.
OpenRefine is open source software with an intuitive user interface for easy manipulations of spreadsheet data. OpenRefine provides many data cleaning and transformation functions which, thanks to the processing history, are easy to document and reproduce. One notable feature is the “Reconciliation” function which allows users to check and enrich their own data against external data providers (e.g. Wikidata, an integrated authority file, CrossRef).
OpenRefine is available for multiple operating systems and can be tried out online here without requiring installation.
See also the slides (in German) from the workshop on Introduction to OpenRefine (2021/22).
Jupyter
Jupyter
Jupyter is an open-source integrated development environment (IDE) for various programming languages such as R and Python from the field of data science. Jupyter follows the literature programming approach, in which code, documentation and output are summarised in one document (Jupyter Notebook). Analysis steps can thus be explained in detail, visualisations can be integrated directly and the content can be exported in various formats. Jupyter can be used locally, online via JupyterLite or with a Google account in Google Colab. For members of Swiss universities and research institutions, EPFL provides an online JupyterHub environment (login via SWITCHedu ID).
Jupyter is an open-source integrated development environment (IDE) for various data science programming languages. Jupyter takes a literate programming approach by combining code and documentation in one document (Jupyter Notebook). This allows analysis steps to be explained in detail, visualizations to be integrated directly into the file, and content to be exported to a variety of formats.
Jupyter can be tried out online here with a variety of kernels. EPFL provides an online JupyterHub environment for associated Swiss universities and research institutes.
SRU
SRU
Search/Retrieve via URL (SRU) is a protocol for search queries on the Internet using CQL, making it possible to search a catalogue in a browser directly via a URL (e.g. without swisscovery). A Jupyter Notebook is available to extract the desired control and subfields from the MARCXML.
Search/Retrieve via URL (SRU) is a protocol for internet searches using CQL. This allows you to perform catalog queries directly in a browser via a URL (e.g., without swisscovery). A Jupyter Notebook is available to extract the desired control and subfields from the MARCXML.
Digital Scholarship Tool Collections
Digital Scholarship Tool Collections | |
---|---|
Text analysis, Natural Language Processing (NLP), literature analysis |
|
Digital Humanities |
|