A Comparision of tidytext and tm

Part 1: Data Structures

Sean Warlick

6 minute read

The default package for many working with text and Natural Language Processing in R has been tm. This past fall, a new package tidytext entered the ring and offers new ways to work with text in R. Having done all of my text analysis using the tm package, I thought it was time to take a look at tidytext and compare the two libraries. We’ll start the comparison by looking at the underlying data structures of the two packages.

Sean Warlick

5 minute read

Introduction {#introduction} In my recent post Learning Python, I promised an article about using python to gather data from web APIs. I was in the midst of building several functions to gather and clean real estate data from Trulia, when I received an email announcing that they were shutting down the API. While it did bring that project to an end, it did inspire me to resume a project to analyze college swimming meet results that had been long abandoned.

Sean Warlick

4 minute read

Clean up days in my apartment building netted me two computers that are perfect for running some basic experiments. The nicer of the two - a Dell Latitude C640- has a 2.40 GHz processor, 1 Gb of RAM and a 60 GB hard drive. It’s basically an over-sized Raspbery Pi. Despite the relatively low power the machine is perfect for learning how to set up an R Studio Server.

Sean Warlick

2 minute read

In the world of ‘Data Science’ there has been a simmering debate over the advantages of R and Python. Ultimately, this debate is futile. Each language has the tools needed to produce high quality data analysis. Hoping to expand and complement my existing tool kit I’ve spent the last couple of months learning Python. The bulk of my learning has been through Coursera’s Python For Everybody Specialization. Developed by Dr. Charles Severance at the University of Michigan, the material is presented with dynamic and detailed lectures.

Sean Warlick

4 minute read

Last Week I had the pleasure to attend a talk given by Hadley Wickham to the Statistical Programming DC Meetup. It was great to have Hadley speak to the group about developing fluent interfaces for R. While the talk was aimed at using Pure, Predictable and Pipeable functions to do software (package) development, these ideas can also be applied to data analysis to create more readable code. In both software development and data analysis, it is important to create code that is easy to read for collaboration and reproducibility.