Summer Reading List 2021
This document contains the reading list for the 2021-2022 cohort. Note that there is a strong correlation between students who struggle and those that fail to take the summer reading list seriously. We spend very little time reviewing the material during the first part of the program; failure to come prepared will limit your own ability to learn.
This document is organized into three sections: (1) Linear Algebra, (2) Probability and Statistics and (3) Computer Language and Tools. It is recommended that you spend time working through your weakest area -- not your strongest. Use your time wisely.
Below is a superset of good linear algebra textbooks for review. In the linear algebra boot camp, the instructor will draw from a combination of these books with emphasis from (1) and (5):
- Introduction to Linear Algebra by Gilbert Strang
- Matrix Analysis and Applied Linear Algebra by Carl D. Meyer
- Applied Linear Algebra and Matrix Analysis by Thomas S. Shores
- Numerical Linear Algebra by Lloyd N. Trefethen and David Bau
- Elementary Linear Algebra by Anton, 11th edition, Wiley
For the time being, you can rely on your linear algebra book from college, as well as the free linear algebra book by Jim Hefferson at:
In your initial review, focus on the following topics: vectors, matrices, and associated operations, solving linear equations, determinants, vector spaces, eigenvalues and eigenvectors, and linear transformations. Time permitting you should engage in any problems associated with computation implementation of all these topics.
If you are looking for an online learning venue, consider taking the OCW Scholar course in linear algebra at the Massachusetts Institute of Technology at:
For further reading Ian GoodFellow’s chapter in Linear Algebra is terse but focused on preparation for deep learning:
If you want a recent video connecting linear algebra and python data science problems enjoy:
Probability and Statistics
In the probability and statistics boot camp (MSDS 504), the instructor will use a combination of the following books:
- Ghahramani, Saeed. Fundamentals of Probability, with Stochastic Processes, 3rd Edition.
- Freund, John. Mathematical Statistics, 8th Edition.
- Ross, Sheldon. Simulation, 5th Edition.
- Moore, David. The Basic Practice of Statistics, 7th Edition.
- Hogg, Robert V., and Elliot A. Tanis. Probability and Statistical Inference, 9th Edition.
Selected excerpts from the above-mentioned books will be available on the Canvas page for MSDS 504. Students will have access to this Canvas page once it is published by the instructor later in the summer.
As you review, focus on the following learning outcomes:
● Understanding the definitions of probability mass functions, probability density functions, cumulative distributions functions, and moments;
● Knowing the properties of the most famous examples of random variables (Bernoulli, binomial, geometric, exponential, Poisson, normal, etc.);
● Mastering the underpinnings of the most common parameter estimation technique, maximum likelihood estimation;
● Understanding the difference between a sample and a population;
● Being able to state the Central Limit Theorem, understanding its importance, and applying it in a variety of basic situations;
● Being able to implement, by hand and in Python, all elementary one- and two-sample tests of hypotheses and confidence interval constructions (e.g., means, proportions, correlation, ratios of variances, etc.);
● Understanding the fundamental axioms, rules, and laws of probability theory;
● Simulating (using Python) random numbers governed by various probability distributions using the method of inverse transformation and the acceptance-rejection technique;
● Defining, and working with examples related to, conditional probability;
● Understanding the importance of the concept of independence;
● Proving and using the Law of Total Probability;
● Using the Law of Total Probability to prove Bayes' Theorem and deploying Bayes' Theorem in a variety of practical situations;
● Working with random vectors as well as random variables;
● Working with multivariate distributions, as well as the concepts of conditional expectation and independence in a high-dimensional setting; and
● Working with the multivariate Gaussian distribution.
If you are looking for an online venue to review this material, there are several you might consider. We recommend the first two courses in University of Michigan’s “Statistics with Python Specialization” at Coursera. We also recommend Berkeley's three-part introduction to statistics -- addressing probability, descriptive statistics, and inferential statistics (at EdX).
Computer Programming Languages and Tools
You’ll be using your laptop every day, either in class or at Practicum. Your laptop must be in perfect working order and have the necessary computing power and storage. If you have a broken or slow laptop, it will directly affect your success in the program.
We strongly recommend you buy a Mac laptop with at the new M1 processor, rather than the Intel processor. That means either the MacBook Air or MacBook Pro 13” which will have very similar performance. Get 16G not 8G of RAM, but the minimum 256G SSD storage is sufficient (minimum). The MacBook Air M1 with those specs is $1,199 without the educational discount (which is usually small).
If you would like a bigger screen, you can look at the MacBook Pro 16” but you will pay a premium ($2399). It has an Intel Core i7 processor that boosts up to 4.5 GHz and 16G RAM at that price. (December 23, 2020)
You are not required to buy a new computer. Some faculty use refurbished laptops. Older models are plenty powerful enough for work as a Data Scientist. Personal computer discounts for USF students, including discounts on Apple products, can be found here.
Let us be very clear that macOS is the preferred operating system for the program. You can get away with Linux (but you might have issues). Completely avoid Windows. Most of the software we use in this program cannot be easily installed and does not work well on Windows. If you choose to use Linux or Windows, you are on your own: Faculty will not be able to help you install software or debug OS-specific issues!
When students are struggling on the programming assignments, our first question to them is: “What did you do in the months prior to the bootcamp?” The answer is typically not studying programming enough. Every year, a few students do not pass the computational boot camp and must exit the program. We have created the following guide so you can properly prepare yourself.
You’ll learn most concepts in this program through coding. The easier it is for you to code and using programming tools, the easier you will find the entire curriculum.
How to learn
Many of you have already taken programming courses, either in-person or online, but you might not have gotten that much out of it. If you want to learn to write how to code, there is no substitute for actually typing code to solve problems.Do not just listen and watch the instructor write a program. You need to write the code. Similarly, you don’t learn to play an instrument by listening to music. When learning to play a musical instrument, first you just cover popular tunes (exactly copying what the instructor is typing). Then you start writing your own riffs and songs (solving problems and doing projects). Just like playing songs and creating is the best part of playing music, solving problems is the best part of coding.
Do not copy and paste code while learning. Manually typing code is part of the learning process. You might make small typos which will help hone your debugging skills and overall coding writing abilities.
Also, write code in an interactive environment (see Jupyter Notebook section below) so you get immediate feedback about what works and what does not work. Later, we’ll write scripts in .py files and run the scripts at the command line. For now, writing scripts will slow down your learning curve.
One of the best ways to study is to sit in front of a blank screen with a coding prompt for a problem you have already solved. Solve the problem from memory without looking at your previous solution or the internet. Only use those resources when you are completely stuck. Drilling skills from memory will reinforce what you have already learned.
One of our favorite tools is Python Tutor, http://pythontutor.com/. It visualizes what happens when you run Python code. It is useful to understand existing code or debug broken code. Being able to visualize code execution is a critical skill for all programmers.
Before the first day of bootcamp, we expect you to:
- Given a real-world problem, write working code.
- Write code that is readable by others.
- Be able to use common tools.
- Write Python code using:
- Built-in types: int, float, str, list, and dict. Be able to use the correct type for the current context. Be able to convert between types.
- Conditional statement: if, elif, and else. Including nesting.
- Iteration with for and while loops.
- Indexing str and list. Selecting single elements and ranges.
- For containers (list and dict). Be able to add and remove elements. Be able to traverse all items for each type.
- Common built-in functions: print, range, round, len, min, max, split
- Reading and writing text files.
- Packages. Installing packages with pip or conda. Importing and using packages.
We suggest everyone complete Python for Everybody (PY4E) https://www.py4e.com/lessons. As stated above you have to learn through doing so create a login and complete the exercises which are autograded.
Here are additional online courses:
- We have also constructed hands-on activity in the form of a friendly coding contest (sorry, no prizes): https://www.hackerrank.com/msds-incoming-student
Throughout the MSDS program, you will use the same tools as professional Data Scientists. That means by the time you start Practicum or a job, you’ll be ready to contribute to the team right away.
Please check out:
Before arriving at orientation, you should have Anaconda 3 installed on your laptop and have some familiarity with the command line (Terminal.app or iTerm2 etc…). Using the command line is a critical skill in this program (all the rest of the examples assume use of the command line). The command line is also called “the shell.” Data Scientists use the command line every day to run scripts, manage files, or use computers in the cloud. Go through a course, such as https://guide.bash.academy/, to make sure you are familiar with the command line.
We mostly use Jupyter Notebooks, for more information see http://jupyter.org/. Notebooks combine programs and text which is nice to interleave your code with your thoughts. Data Science programming is particularly challenging because of the variety of tools we use. Notebooks are extremely helpful because data, data frames, code, graphs, and output are all in the same document. Here is a video tutorial to check out: https://www.youtube.com/watch?v=HW29067qVWk