9. Data Science and IPython

Author:Peter Parente

9.1. Goals

  • Stick a definition on “data science”
  • Learn about Python libraries for data science
  • Understand the usefulness of IPython Notebook
  • Practice manipulating data using Pandas
  • Practice plotting using matplotlib
  • Practice machine learning with scikit-learn

9.2. Introduction

Data Science is the burgeoning study and practice of extracting knowledge from data. It combines ideas and techniques from many fields including machine learning, statistics, mathematics, data warehousing, parallel and distributed computing, visualization, and many others. As the amount of data available to businesses and researchers continues to grow, so does the need for creative teams and powerful tools to draw insight from it.

Python has a large ecosystem of tools relevant to data science. In this session we will look at combining a small but very powerful set of them to explore (Pandas), model (scikit-learn), and visualize (matplotlib) information in a web environment for reproducible research (IPython Notebook).

To get started, watch the IPython Notebook slidecast (~77 minutes) showing the use of these tools in a basic exploration of the wine dataset from the UCI Machine Learning Repository. The slidecast includes demos of the following:

You can view and/or download a version of the wine analysis notebook built throughout the slidecast via the IPython Notebook Viewer.

If time permits, review these additional pages:

9.3. Exercises

You will need to complete the Setting Up instructions before you proceed with these exercises. Once you are set up, SSH into tottbox using the vagrant ssh command from the setup instructions. Then tackle the problems below. In this particular session, you can post your IPython Notebooks as gists, view them in NBViewer, and share their NBViewer URLs in the TotT community later.

9.3.1. Start a notebook

Create a shared tottbox folder to store your notebooks and start the IPython Notebook server running like so:

mkdir -p /vagrant/notebooks
cd !$
ipython notebook --ip=

Open your web browser and visit it on the tottbox IP and port printed, Create a new notebook in the web UI.

9.3.2. Load the Tar Heel Reader book data

Gary runs a web site called Tar Heel Reader (THR). It hosts a collection of community-contributed, easy-to-read, accessible books. If you haven’t seen it, visit the site now and read a book. In the following exercises, we will analyze a November, 2013 snapshot of the books hosted on the site (approximately 30,000).

THR books reside in a SQL database. To ease our exploration, I’ve converted the books SQL database table into a Pandas DataFrame and serialized it to disk. Download a zipped copy of the DataFrame to your laptop. Unzip it and move the *.dataframe object into a shared tottbox folder (e.g., putting it in the notebooks folder you created is fine for now).

In the notebook you created, import the pandas package and load the DataFrame with the code below.

import pandas

9.3.3. Take some basic measurements

With the DataFrame loaded, use the many methods and properties of the DataFrame to explore the data. Try to answer the following questions. (Hint: Use the Pandas documentation. Hit the Tab key repeatedly after . or ( in the notebook for autocompletion and function help.)

  • What are the columns in the DataFrame?
  • What does each row represent?
  • How many total rows are there?
  • How many total books are there?
  • How many books have been reviewed? Haven’t?
  • Books are written in how many different languages?
  • What is the mean number of pages per book? Median? Minimum? Max? Variance?
  • How many different authors have written books?

9.3.4. Prep words per page (wpp) data

Say we want to understand how the length of the pages in the Tar Heel Reader books have changed or not changed over time. To do so, we first have to chunk the page text into words based on some definition. Choose a definition and write it down in your notebook in a Markdown cell. Then use the apply method on the text column (a Series) of the DataFrame to do so. Pass it a function that splits each page of text into a list of words according to your definition. Save the return value in a variable called words.

After producing the words Series, create another series called wpp. Use the apply method again, but this time compute the number of words per page instead of the words themselves.

9.3.5. Plot wpp over time

Return to the original DataFrame. Inspect some of its rows using the head and tail methods. Is it ordered in some way? Write your assumptions in a Markdown cell in your notebook.

Now plot the wpp Series you created in the prior step using the Series.plot method. The y-axis should represent the number words on a page and the x-axis should represent a page in a book. The pages should be sorted in ascending chronological order as x increases.

Can you spot a trend in the plot? What if you play with the plotting parameters? Try a scatter plot instead? Take Markdown notes in your notebook.

9.3.6. Plot the rolling, expanding wpp mean

Pandas has quite a few functions for computing moving statistics, stats computed over an ordered sample of data. Try using the moving mean function on the wpp Series and plot the results. Try a few more times with different parameter values. What does it do? What do you see? Write it in the notebook. (Hint: http://en.wikipedia.org/wiki/Moving_average)

Pandas also has support for expanding windows, stats computed over an ordered sample of data up to and including each datum in the order. Try using the expanding mean on the wpp Series. Try a few more times with different parameter values. What do you see? Write it in the notebook?

Is there anything interesting to report from these plots?

9.3.7. Consider pages per book (ppb) over time

Say we now want to understand how the pages per book (ppb) metric varies over time. Prepare a ppb Series and study it. Note any interesting findings in your notebook. (Hint: The DataFrame.groupby method will get you started with preparing the data.)

9.3.8. Learn about clustering

THR authors can assign one or more fixed categories to their books. Nothing dictates that books must fit the available categories, and so it’s quite possible that additional categories or alternative organization schemes exist. One way to discover such patterns is to cluster books according to some measure of similarity and then simply study the books in a cluster.

The scikit-learn package has many clustering algorithms available. The basic one that we’ll use is called k-means clustering. Given an integer k number of clusters, k-means will attempt to partition our n books so that each book belongs to the cluster with the nearest mean-value for some property of our books. We need to choose a value for k and decide what property we’ll use to cluster them.

Picking k is empirical. We’ll try a few values and see what results we get. Deciding what property we’ll use to cluster requires more thinking. If we want to discover common themes or topics across books, we might try clustering our books based on their titles. However, we have to remember that THR has books written in many languages. If we try running the clustering algorithm across all books at once, it’s not clear how books written in different languages will or will not relate. To simplify our task, we’ll focus on books written in English alone for the time being. (We can always try clustering on other languages independently or across languages later.)

9.3.9. Prep English titles

Use Pandas to get a Series of unique English book titles from the books DataFrame you loaded. This step amounts to a one-liner in which you:

  1. Select rows in the DataFrame that have language equal to “en”
  2. Select the title column from the remaining rows
  3. Drop duplicate titles

Once you have the title Series, you need to transform the titles into feature vectors on which the k-means algorithm can operate. The sklearn.feature_extraction.text package has a number of classes that can do this with minimal effort. Add the following imports to your notebook:

from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfVectorizer

Now read the scikit-learn doc about these three classes and use each of them to transform your title Series into a new, independent series: count, hash, tfidf.

Start simply and use defaults where possible. Until you can visualize how the clustering is working, it makes little sense to start turning random knobs.

9.3.10. Cluster English titles

We’ll now run the k-means clustering algorithm over each one of your transformed title Series. The immediate goal is to get a sense of how our choice of parameters affects the ability of k-means to decompose the entire set of books into clusters of books related by title.

Add the following import to your notebook:

from sklearn.cluster import KMeans

Construct an instance of the class called km. Configure it to create 20 clusters. Then fit the class to the first of your three title transformation Series, count. Once you’ve fit the model, create a new DataFrame that pairs the human-readable book titles with the assigned cluster IDs like so:

# where titles is your untransformed title Series
en_titles = pandas.DataFrame(titles)
en_titles['count_cluster'] = km.labels_

Re-fit the km algorithm to your hash and tfidf Series. Add each one to en_titles as a new column.

Now, for each of the three *_cluster columns you created, determine how many books fall into each of the 20 clusters. (Hint: groupby should help you here.)

Does the clustering algorithm appear to work better or worse for any of the transformations? What if you choose to create fewer or more clusters? What if you play with other options to the Vectorizer constructors or the KMeans constructor? Try turning some knobs and document what you discover in your notebook.

9.3.11. Visualize your clusters

The k-means algorithm assigns each book title to a cluster identified by an integer. That is all. Interpreting the cluster assignments in light of the book titles is the responsibility of the analyst (i.e., you).

Start this task by printing some of the tiles in a cluster with the following code:

en_titles[en_titles.count_cluster == 0].head(25)

Vary the column name, cluster integer, method of sampling, and sample size. Do you see any patterns within your clusters? Can you assign a category name to any cluster (e.g., books about X).

Studying clusters in this manner is inefficient at best and biased at worst. For instance, just because you look at the first 25 titles in a set of 900 books doesn’t mean those 25 are representative of the full set.

Find a way to better visualize and interpret your clusters. Consider manipulations of the titles and clusters using Pandas to show cluster contents compactly and without bias. Consider using matplotlib to display the information graphically in some way. Demonstrate your technique and document its pros and cons.

9.3.12. Interpret your results

Do your clusters experiments reveal any patterns in book titles? Do they suggest any complementary categorizations or tags for books on the THR site? Do they suggest common topics addressed by THR authors?

Are there clusters that are not easy to explain? Are there books that seem to befuddle clustering? Do you have any ideas about how to study and understand these books better?

9.4. Projects

If you want to try your hand at something larger than an exercise, consider one of the following.

9.4.1. Find books misclassified by language

Gary says that some number of books on the Tar Heel Reader site are marked as having the wrong language. Manually finding these misclassifications is a pain. A language classifier could help alleviate these problems. Using the data provided, we could:

  1. train the classifier on a set of books with known-to-be-correct language assignments (the ground truth),
  2. evaluate the accuracy of the classifier on a hold-out test set of books by comparing its language predictions with the ground-truth,
  3. apply a well-performing classifier to the entire set of books, and
  4. review those books where the classifier predicts a language that mismatches the language assigned by the human author.

The text document classification example in the scikit-learn documentation might help get you started. So might the sample pipeline for text feature extraction and evaluation in the scikit-learn doc. In fact, there are many ways to skin this cat using scikit-learn. The key is setting up your notebook to quickly try new experiments in defining features, in picking a classifier algorithm, in choosing classifier parameters, and in evaluating performance.

If you want to tackle this project in earnest, talk with Pete. He has some feature selection code that might help.

9.4.2. Build a recommendation engine

Gary has a second dataset derived from the Tar Heel Reader site that captures what books were read by what visitors to the site over time. This data can be used to train a recommendation engine based on collaborative filtering. Talk with Gary if you are interested in playing with this dataset and building a recommendation engine for the THR site.

9.4.3. Improve the IPython Notebook UI

jtyberg writes:

I love IPython notebook for ad-hoc analysis. However, there are a few shortcomings of the web UI that lessen my user experience. Among them is the tedious nature of reordering cells (moving them up or down) within a notebook. I would like to be able to select multiple cells and move them up/down the page all at once.

A possible solution would be to enable grouping of cells. Can we modify the underlying DOM structure by adding cell elements into the same parent? Then we can manipulate the parent element.

Another idea would be a gutter view within the notebook that shows a condensed view of the notebook content (think Sublime text editor). What if we could select individual cells or cell groups and move them up/down the page by dragging and dropping from within the gutter? That would be sweet.

This is more of a JavaScript project and is posted again in the jQuery session project list. The IPython Notebook has an unstable but working JavaScript API that might be useful in accomplishing either or both of these.

9.5. References

Choosing the right estimator
A rough guide for choosing the right scikit-learn algorithm for your machine learning task
A gallery of interesting IPython Notebooks
Gallery of IPython Notebooks
Matplotlib gallery
Gallery of matplotlib examples
Scikit-learn examples
Gallery of scikit-learn examples
Python Scientific Lecture Notes
Tutorial material on the scientific Python ecosystem
Parallel Machine Learning with scikit-learn and IPython
Tutorial on machine learning over “big data”