Learn to Code by Solving Problems: A Python Programming Primer Homepage for the book Learn to Code by Solving Problems: A Python Programming Primer

Book Review: Dive Into Data Science

Hi everyone,

What do you do next after finishing Learn to Code by Solving Problems?

That’s a good question I receive often. There are many possible answers, including practicing more problem solving, reading a follow-up Python book, working on larger programming projects… or specializing in a certain area!

Speaking of specializing: Data Science (making sense of data by exploring, visualizing, and making predictions) is a hype career right now, and you’re well positioned to start on that career path after finishing my book. That’s because Python is a great language for doing Data Science.

Starting off in Data Science is challenging not because there are too few resources, but because there are too many! At last count, I am aware of 11,324,482,394 Youtube videos… OK, joking, but you get the point. Really, where do you start?

I’m happy to be able to personally recommend Dive Into Data Science by Dr. Bradford Tuckfield. As with any recommendation that I make or any book I endorse: I have read every page, cover to cover.

Why I Like This Book

If you’re going to use Python for Data Science, then there are Python packages that you simply have to know. Pandas is one of them, used to organize data. Matplotlib is another, used to visualize data. Seaborn is a third, again used to visualize data. This book jumps straight to using Pandas, Matplotlib, and Seaborn, all in Chapter 1! You’ll find scikit-learn, a hugely important machine learning library, in Chapter 2, along with numpy for working with huge matrices of data. You’ll find SciPy in Chapter 3 for performing complex statistical tests. Yeah: Dr. Tuckfield doesn’t mess around. (You’ll have to install each of these packages before use, but then you’ll be good to go.)

People often make the mistake of hammering out some gee-whiz next-level stats on their data before they even know what their data looks like! It’s important to explore the data first, and I appreciate that Dr. Tuckfield starts with data exploration in Chapter 1. You’ll learn how to calculate summary statistics and correlations, look at particular slices of your data, and visualize the results of analyses.

Do you want to read about Widget Corp. and the number of widgets they produce per year? No? Don’t worry: there’s none of that here. The very first example you see in Chapter 1 uses real bike-sharing data, and I find the real data throughout the book to be both more motivating and more honest than tiny fake datasets. (When datasets are fabricated, I still find them interesting reflections of reality!)

Much of the data that you’ll find and use as a data scientist will be in something called a csv file. This book shows you how to work with csv files, including reading their data and cleaning up that data prior to analysis. You won’t always have the csv file on your computer, so you’ll also learn how to access csv files on the Internet. I would have appreciated some other file formats to be sprinkled in as well, such as Excel spreadsheets, but that’s a minor quibble: reading Excel files is something you can easily look up, and the end result will be a data frame just like it would be for csv files. I’ve also found some Python code that could be simplified (e.g. the list comprehensions in Chapter 2). These are my only minor suggestions for improvement.

One thing data scientists do a lot of is forecasting or making predictions. For example, a lot of my own research tries to make predictions about student outcomes given available student data. For this kind of analysis, we use tools like linear regression (with one predictor or many), t-tests, logistic regression, k-nearest neighbors, decision trees, random forests, and neural networks. Other times we want to organize data into similar groups (such as similar customers or similar books), and to do that we use various forms of clustering. All of these tools are explored in the book. Training sets and test sets, overfitting, interpretable vs. non-interpretable models, supervised vs. unsupervised learning: it’s all here. I appreciate that Dr. Tuckfield also briefly goes into how these tools work behind the scenes for anyone who is interested. You’ll even implement some of the analyses yourself before relying on a Python package to do the magic.

We often don’t have access to every entity in the world that makes up our population of interest. For example, maybe we want to understand the amount of plastic used by all restaurants for purposes of reducing waste, but getting this information from every restaurant would be impossible. What do we do? It’s called sampling, and I like Dr. Tuckfield’s in depth explanation of sampling in Chapter 3. Also: if you’ve always wanted to know what a p-value really means, read this chapter.

Chapter 4 is all about A/B testing. It’s how you decide which of multiple options leads to the best outcomes (the best conversion rate on shopping carts, or email engagement, or whatever). You’ll learn how to properly run an experiment here that doesn’t have confounds and actually tests what you want to test, and to carefully consider ethical concerns throughout the process.

The book includes a brief introduction to web scraping in Chapter 8 because, let’s face it: your data is not always going to come as nicely packaged csv files! Sometimes you’ll have to go get it (keeping in mind each website’s rules about whether you are allowed to do that, of course).

There are two standalone chapters near the end of the book that offer introductions to recommendation systems and natural language processing (NLP). You know how Netflix and Amazon and every other website in the world tries to sell you stuff that’s similar to what you like? That’s a recommendation system. And you know how teachers use software to detect plagiarism? That’s NLP. These are two advanced areas of data science that could fill entire books, but the introduction here is likely sufficient to help you determine whether you’re interested in pursuing these topics further.

The book ends with a brief introduction to SQL and R, two other popular data science tools. I often don’t like these “oh BTW here’s a bunch of other stuff that you don’t know yet” chapters, but here I do like it because it shows you that, really, these tools are all getting at the same things with techniques that you’ve already learned in the book.

Summary

I highly recommend this book as your introduction to Data Science. You’ll use the most important Python data science packages, work with real data, and learn how to visualize and analyze data in ways that today’s data scientists do. The trick to writing an introductory Data Science book is to land on the right balance between “now watch what this magical line of Python code does” and “here’s what’s happening and what this approach is good for”. I think Dr. Tuckfield has done it.