This is the course website for a previous iteration of the course. If you’re looking for the most recent course website, look at practicaldsc.org.
đź“š Resources
Table of contents
Past Exams
While this class specifically hasn’t been offered yet, it is inspired by a few different courses that have been offered many times, many of which have banks of old exams available online. The most relevant problems will be posted at our brand-new 🧠Study Site, which you’ll use in discussion section.
If you’d like some additional practice, you can refer to:
- practice.dsc80.com – most similar to our course.
- practice.dsc40a.com – more theoretical than our course, but some problems will be relevant.
- practice.dsc10.com – more introductory-level than our course, but some DataFrame-related problems will be relevant.
Textbooks
- Principles and Techniques of Data Science, the textbook for Berkeley’s Data 100 course.
- These are also supplemented by a set of Course Notes.
- DSC 80 Course Notes. These notes were originally written for UCSD’s version of this course, but have not been updated in a few years.
- Python for Data Analysis, an online textbook by Wes Mickinney, the original developer of
pandas
. - DSC 10 Course Notes. These notes were written for UCSD’s more introductory data science course, which introduces Python and Jupyter Notebooks. You’ll find a lot of the material here useful, too.
- Stanford’s Python Reference.
- EECS 201: Computer Science Pragmatics. This class covers “the essentials of using a computer effectively for EECS students,” and covers Unix-like systems, shells, version control, build systems, debugging, and scripting.
Topic-Specific Resources
There are lots of readings linked on the course website. Here, we’re collecting other helpful resources that will help explain ideas in the course. If you found something online that was super helpful, let us know and we’ll add it here!
Python
- pythontutor.com, a tool to visualize the execution of Python programs.
- Facts and myths about Python names and values – good to read if you’re confused about how variables and mutability work in Python.
pandas
- pandastutor.com, the equivalent of pythontutor.com for DataFrame manipulation.
- Views and Copies in
pandas
– a great read if you’d like to learn more about the infamousSettingWithCopyWarning
.
Visualization
- UC Berkeley Data 100 Lecture 10 (by Suraj).
- UCSD DSC 106: Data Visualization.
- UW CSE 442: Data Visualization.
Missing Values
Web Scraping
- STATS 701 notes – these are in R, but are still helpful for giving you a general idea of what you can scrape and how.
Regular Expressions
- regex101.com, a helpful site to have open while writing regular expressions.
- Python
re
library documentation and how-to. - regex “cheat sheet” (taken from here).
Machine Learning
- A Visual Introduction to Machine Learning and Model Tuning and the Bias-Variance Tradeoff, excellent visual descriptions of the last few weeks of the course (some terminology is different, but the ideas are the same).
- MLU Explain, a collection of interactive articles (prepared by Jared Wilber) that explain core machine learning ideas, like:
- Linear Regression.
- The Bias-Variance Tradeoff.
- Train, Test, and Validation Sets.
- Cross-Validation.
- and other ideas we’ll see later in the semester!
Finding Datasets
Generic Sources of Data
These sites allow you to search for datasets (in CSV format) from a variety of different domains. Some may require you to sign up for an account; these are generally reputable sources.
- Kaggle Datasets.
- Google’s dataset search.
- FiveThirtyEight.
- DataHub.io.
- Data.world.
- CORGIS.
- R datasets.
- Wikipedia. (use this site to extract and download tables as CSVs)
- Awesome Public Datasets GitHub repo.
- Awesome JSON Datasets GitHub repo.
- Data from Introduction to the Digital Humanities at MSU.
- Sage Research Methods Datasets.
- Links to even more sources.
Domain-Specific Sources of Data
- Sports: Basketball Reference, Baseball Reference, etc.
- US Government Sources: census.gov, data.gov, data.ca.gov, data.sfgov.org, FBI’s Crime Data Explorer, Centers for Disease Control and Prevention.
- Environment: National Centers for Environmental Information (e.g. Oceanography data from NOAA), Environmental Data Initiative.
- Social Sciences: Inter-university Consortium for Political Science Research, General Social Survey, data.worldbank.org, databank.worldbank.org, WHO.
- Transportation: New York Taxi trips, Bureau of Transportation Statistics, SFO Air Traffic Statistics.
- Music: Spotify Charts.
- COVID: Johns Hopkins.
- Any Google Forms survey you’ve administered! (Go to the results spreadsheet, then go to “File > Download > Comma-separated values”.)
Tip: if a site only allows you to download a file as an Excel file, not a CSV file, you can download it, open it in a spreadsheet viewer (Excel, Numbers, Google Sheets), and export it to a CSV.
Here’s a tutorial on how to download JSON data from data.gov, for example.
University of Michigan Library Guides
The university library system maintains several guides on how to conduct research and where to find information. They contain lots of links to local data sources. Here are a few guides of interest:
- Guide on Finding Data.
- Guide on Community Data.
- Guide on Geospatial Data.
- Guide on Detroit Maps.
- Guide on Political Science Data.
- Watch this video for guidance on how to search for Political Science research work.
- Here’s a related tutorial on how to download raw datasets.
- Guide on News Data.
- General Engineering and Computer Science research guides.
If you have questions about how to use any of these guides, or how to use any of the other resources our library has to offer, contact Sarah Barbrow (sbarbrow@umich.edu), our Engineering librarian (who also recorded this video, of interest to students who are looking to get into social sciences research)!