Summer Reading List
USF MSDS Summer Reading List 2019
This document contains the reading list for the 2019-2020 cohort. Note that there is a strong correlation between students who struggle and those that fail to take the summer reading list seriously. We spend very little time reviewing the material during the first part of the program; failure to come prepared will limit your own ability to learn.
This document is organized into three sections: (1) Linear Algebra, (2) Probability and Statistics and (3) Computer Language and Tools. It is recommended that you spend time working through your weakest area -- not your strongest. Use your time wisely.
Below is a superset of good linear algebra textbooks for review. In the linear algebra boot camp, the instructor will draw from a combination of these books with emphasis from (1) and (5):
- Introduction to Linear Algebra by Gilbert Strang
- Matrix Analysis and Applied Linear Algebra by Carl D. Meyer
- Applied Linear Algebra and Matrix Analysis by Thomas S. Shores
- Numerical Linear Algebra by Lloyd N. Trefethen and David Bau
- Elementary Linear Algebra by Anton, 11th edition, Wiley
For the time being, you can rely on your linear algebra book from college, as well as the free linear algebra book by Jim Hefferson at:
In your initial review, focus on the following topics: vectors, matrices, and associated operations, solving linear equations, determinants, vector spaces, eigenvalues and eigenvectors, and linear transformations. Time permitting you should engage in any problems associated with computation implementation of all these topics.
If you are looking for an online learning venue, consider taking the OCW Scholar course in linear algebra at the Massachusetts Institute of Technology at:
For further reading Ian GoodFellow’s chapter in Linear Algebra is terse but focused on preparation for deep learning:
If you want a recent video connecting linear algebra and python data science problems enjoy:
Probability and Statistics
In the probability and statistics boot camp (MSAN 504), the instructor will use a combination of the following books:
- Ghahramani, Saeed. Fundamentals of Probability, with Stochastic Processes, 3rd Edition.
- Freund, John. Mathematical Statistics, 8th Edition.
- Ross, Sheldon. Simulation, 5th Edition.
- Moore, David. The Basic Practice of Statistics, 7th Edition.
- Hogg, Robert V., and Elliot A. Tanis. Probability and Statistical Inference, 9th Edition.
Selected excerpts from the above-mentioned books will be available on the Canvas page for MSAN 504. Students will have access to this Canvas page as soon as they log onto MyUSF (and the Canvas page is published).
As you review, focus on the following learning outcomes:
- Understanding the definitions of probability mass functions, probability density functions, cumulative distributions functions, and moments;
- Knowing the properties of the most famous examples of random variables (Bernoulli, binomial, geometric, exponential, Poisson, normal, etc.);
- Mastering the underpinnings of the most common parameter estimation technique, maximum likelihood estimation;
- Understanding the difference between a sample and a population;
- Being able to state the Central Limit Theorem, understanding its importance, and applying it in a variety of basic situations;
- Being able to implement, by hand and in R, all elementary one- and two-sample tests of hypotheses and confidence interval constructions (e.g., means, proportions, correlation, ratios of variances, etc.);
- Understanding the fundamental axioms, rules, and laws of probability theory;
- Simulating (using R) random numbers governed by various probability distributions using the method of inverse transformation and the acceptance-rejection technique;
- Defining, and working with examples related to, conditional probability;
- Understanding the importance of the concept of independence;
- Proving and using the Law of Total Probability;
- Using the Law of Total Probability to prove Bayes' Theorem and deploying Bayes' Theorem in a variety of practical situations;
- Working with random vectors as well as random variables;
- Working with multivariate distributions, as well as the concepts of conditional expectation and independence in a high-dimensional setting; and
- Working with the multivariate Gaussian distribution.
If you are looking for an online venue to review this material, there are several you might consider. We recommend the first two courses in Duke University's “Statistics with R Specialization” at Coursera. We also recommend Berkeley's three-part introduction to statistics -- addressing probability, descriptive statistics, and inferential statistics (at EdX).
Computer Programming Languages and Tools
You’ll be using your laptop every day, either in class or at Practicum. Your laptop must be in perfect working order and have the necessary computing power and storage. If you have a broken or slow laptop, it will directly affect your success in the program.
We strongly recommend you buy a Mac laptop with at least the following specs:
Processor (one of the following):
- 2.9 GHz Dual-Core Intel Core i5, Turbo Boost up to 3.3GHz
- 2.6GHz Quad-core Intel Core i7 processor, Turbo Boost up to 3.5GHz
- 16G of RAM (not 8G)
- 128 GB PCIe-based Flash (minimum)
Let us be very clear that macOS is the preferred operating system for the program. You can get away with Linux (but you might have issues). Completely avoid Windows. Most of the software we use in this program cannot be easily installed and does not work well on Windows. If you choose to use Linux or Windows, you are on your own: Faculty will not be able to help you install software or debug OS-specific issues!
When students are struggling on the programming assignments, our first question to them is: “What did you do in the months prior to the bootcamp?” The answer is typically not studying programming enough. Every year, a few students do not pass the computational boot camp and must exit the program. We have created the following guide so you can properly prepare yourself.
There is a saying about young children - “First learn to read, then read to learn.” You are similar - “First learn to code, then code to learn.” You’ll learn most concepts in this program through coding. The easier it is for you to code and using programming tools, the easier you will find the entire curriculum.
How to learn
Many of you have already taken programming courses, either in-person or online, but you might not have gotten that much out of it. If you want to learn to write how to code, there is no substitute for actually typing code to solve problems. Do not just listen and watch the instructor write a program. You need to write the code. Just like you don’t learn to play an instrument by listening to music. When learning to play a musical instrument, first you just cover popular tunes (exactly copying what the instructor is typing). Then you start writing your own riffs and songs (solving problems and doing projects). Just like playing songs and creating is the best part of playing music, solving problems is the best part of coding.
Do not copy n’ paste code while learning. Manually typing code is part of the learning process. You might make small typos which will help hone your debugging skills and overall coding writing abilities.
Also, write code in an interactive environment, see Jupyter Notebook section below, so you get immediate feedback about what works and what does not work. Later we’ll write scripts in .py files and run the scripts at the command line. For now, writing scripts will slow down your learning curve.
One of the best ways of studying is to sit in-front of a blank screen with a coding prompt for a problem you have already solved. Solve the problem from memory without looking at your previous solution or the internet. Only use those resources when you are completely stuck. Drilling skills from memory will reinforce what you have already learned.
One of our favorite tools is Python Tutor, http://pythontutor.com/. It visualizes what happens when you run Python code. It is useful to understand existing code or debug broken code. Being able to visualize code execution is a critical skill for all programmers.
Before the first day of bootcamp, we expect you to:
- Given a real-world problem, write working code.
- Write code that is readable by others.
- Be able to use common tools.
- Write Python code using:
- Built-in types: int, float, str, list, and dict. Be able to use the correct type for the current context. Be able to convert between types.
- Conditional statement: if, elif, and else. Including nesting.
- Iteration with for and while loops.
- Indexing str and list. Selecting single elements and ranges.
- For containers (list and dict). Be able to add and remove elements. Be able to traverse all items for each type.
- Common built-in functions: print, range, round, len, min, max, split
- Reading and writing text files.
- Packages. Installing packages with pip or conda. Importing and using packages.
- We suggest everyone complete Python for Everybody (PY4E) https://www.py4e.com/lessons. As stated above you have to learn through doing so create a login and complete the exercises which are autograded.
- Here are additional online courses:
- We have also constructed hands-on activity in the form of a friendly coding contest (sorry, no prizes) - https://www.hackerrank.com/msds-incoming-studhttps://www.hackerrank.com/msds-incoming-studentent.
- At the start of the first class of the Computational Bootcamp, there will be a quiz to verify that you have studied these materials and completed the coding contest.
R Programming Language
Most courses will be in the Python Programming language. There are specific subjects for which the R programming language excels, primarily time series and visualization.
We suggest you start learning R with Data Camp’s “Introduction to R” in the browser.
Next, setup R on your computer and learn on your computer with the Swirl package.
“Intro to R” videos on YouTube will walk you through the fundamentals.
Finally, you’ll be spending time in the tidyverse (a collection of R packages that work well together for Data Science). You can learn about the tidyverse with “R for Data Science”.
Additionally, “Cookbook for R” is a great resource for solving common data science problems.
Throughout the MSDS program, you will use the same tools as professional Data Scientists. That means by the time you start Practicum or a job, you’ll be ready to contribute to the team right away.
Before arriving at orientation, you are required to have the following software installed on your laptop and be familiar with them:
Using the command line is a critical skill in this program (all the rest of the examples assume use of the command line). The command line is also called “the shell” or Bash (the name of the specific shell we use). Data Scientists use the command line every day to run scripts, manage files, or use computers in the cloud. Go through a course, such as http://www.bash.academy, to make sure you are familiar with the command line. You can use the built-in Terminal App or iTerm2, a macOS Terminal Replacement https://www.iterm2.com/
Install Anaconda 3, which includes Python 3 and most of the packages you will need in this program. You use either the graphic installer or the command line installer https://www.anaconda.com/distribution/#download-section
After you install it, make sure it works. Open the Terminal App and type:
You should see something like this:
Python 3.6.8 |Anaconda custom (64-bit)| (default, Dec 29 2018, 19:04:46)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.3.0 -- An enhanced Interactive Python. Type '?' for help.
Notice that it is running Anaconda and Python 3.x. If something else happens, Google is your friend to figure out why it did not start using Anaconda’s version of Python 3.x (by 3.x we mean any version with a major release number of 3 - 3.6, 3.7, 3.8, … The numbers after the decimal are minor release numbers.)
We mostly use Jupyter Notebooks, for more information see http://jupyter.org/. Notebooks combine programs and text which is nice to interleave your code with your thoughts. Data Science programming is particularly challenging because of the variety of tools we use. Notebooks are extremely helpful because data, data frames, code, graphs, and output are all in the same document. Here is a video tutorial to check out https://www.youtube.com/watch?v=HW29067qVWk