Text Preparation & Data Cleanup

The list of tools below is organized alphabetically, and it represents a selection of the resources available to Digital Humanists. Many of these tools are actively updated, so please contact the DH@Bucknell Web Team if you find any outdated information or if you would like to suggest additional tools or software.

Bucknell University has site licenses and provides faculty, staff, and students with access to and support for a number of these tools; tools for which this is the case have “BU access” listed under pricing.


DataWrangler

DataWrangler is an interactive tool for data cleaning and transformation. Wrangler allows interactive transformation of messy, real-world data into the data tables analysis tools expect. Export data for use in Excel, R, Tableau, Protovis, and more.

Details

Website: http://vis.stanford.edu/wrangler/
Open Source Software (OSS) or Proprietary? OSS
Pricing: Free
Additional Information: The DataWrangler software is no longer actively supported. The group that started DataWrangler have started a commercial venture, Trifacta, which offers Trifacta Wrangler (actively supported).


Documenting the Now (DocNow)

Documenting the Now, or DocNow, develops tools and builds community practices that support the ethical collection, use, and preservation of social media content. DocNow has created a suite of tools known as “Twarc” that assist with the collection and cleanup of Twitter data.

Details

Website: https://www.docnow.io/
Open Source Software (OSS) or Proprietary? OSS
Pricing: Free


Lexos

Lexos is a web-based tool designed for transforming, analyzing, and visualizing texts. Lexos is designed for use primarily with small to medium-sized text collections, and especially for use with ancient languages and languages that do not employ the Latin alphabet. Lexos was created as an entry-level platform for Humanities scholars and students new to computational techniques while providing tools and techniques sophisticated enough for advanced research.

Details

Website: http://lexos.wheatoncollege.edu/upload
Open Source Software (OSS) or Proprietary? OSS
Pricing: Free


OpenRefine

OpenRefine helps you work with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data. OpenRefine is available in English, Chinese, Spanish, French, Russian, Portuguese (Brazil), German, Japanese, Italian, Hungarian, Hebrew, Filipino, Cebuano, and Tagalog.

Details

Website: http://openrefine.org/
Open Source Software (OSS) or Proprietary? OSS
Pricing: Free


Pandoc

Pandoc is a “universal document converter” that enables the easy conversion of markup formats. Pandoc support Markdown, AsciiDoc, Emacs Org-Mode, Textile, HTML5, EPUB, TEI Simple, Microsoft Word docx, OpenDocument XML, LaTeX Beamer, PDF, and more.

Details

Website: https://pandoc.org/index.html
Open Source Software (OSS) or Proprietary? OSS
Pricing: Free


PhoTransEdit

PhoTransEdit is a set of free applications that reduces the time it takes to make English transcriptions. It provides automatic phonemic transcriptions, output customization and edition, export transcriptions to several formats and many other functionalities.

Details

Website: http://www.photransedit.com/
Open Source Software (OSS) or Proprietary? OSS
Pricing: Free


Scrapy

Scrapy is an open source and collaborative framework for extracting the data you need from websites in a fast, simple, yet extensible way.

Details

Website: https://scrapy.org/
Open Source Software (OSS) or Proprietary? OSS
Pricing: Free


Social Feed Manager

Social Feed Manager is open source software that harvests social media data and web resources from Twitter, Tumblr, Flickr, and Sina Weibo. It empowers researchers, faculty, students, and archivists to collect, manage, and export social media data.

Details

Website: https://gwu-libraries.github.io/sfm-ui/
Open Source Software (OSS) or Proprietary? OSS
Pricing: Free


SourceCaster

SourceCaster helps you use the command line to work through common challenges that come up when working with digital primary sources, including the conversion of pdf to txt and managing file names.

Details

Website: http://thomasgpadilla.github.io/sourcecaster/
Open Source Software (OSS) or Proprietary? OSS
Pricing: Free


Stanford Named Entity Recognizer (NER)

Stanford NER is a Java implementation of a Named Entity Recognizer. Named Entity Recognition (NER) labels sequences of words in a text which are the names of things, such as person and company names, or gene and protein names. It comes with well-engineered feature extractors for Named Entity Recognition, and many options for defining feature extractors.

Details

Website: https://nlp.stanford.edu/software/CRF-NER.shtml
Open Source Software (OSS) or Proprietary? OSS
Pricing: Free


TextCleanr

TextCleanr is a quick, easy, web based way to fix and clean up text when copying and pasting between applications. Remove email indents, find and replace, clean up spacing, line breaks, word characters and more. Perfect for tablets or mobile devices.

Details

Website: http://www.textcleanr.com/
Open Source Software (OSS) or Proprietary? Proprietary
Pricing: Free


Trifacta

Trifacta accelerates data cleaning and preparation with a modern platform for cloud data lakes and warehouses.

Details

Website: https://www.trifacta.com/
Open Source Software (OSS) or Proprietary? Proprietary
Pricing: Free use of Trifecta Wrangler; pricing tiers for additional services