đź“š Resources
Table of contents
Past Exams
While this class specifically hasn’t been offered yet, it is inspired by a few different courses that have been offered many times, many of which have banks of old exams available online. The most relevant problems will be posted at our brand-new 🧠Study Site, which you’ll use in discussion section.
If you’d like some additional practice, you can refer to:
- practice.dsc80.com – most similar to our course.
- practice.dsc40a.com – more theoretical than our course, but some problems will be relevant.
- practice.dsc10.com – more introductory-level than our course, but some DataFrame-related problems will be relevant.
Readings
Textbooks
- Principles and Techniques of Data Science, the textbook for Berkeley’s Data 100 course.
- These are also supplemented by a set of Course Notes.
- DSC 80 Course Notes. These notes were originally written for UCSD’s version of this course, but have not been updated in a few years.
- Python for Data Analysis, an online textbook by Wes Mickinney, the original developer of
pandas
. - DSC 10 Course Notes. These notes were written for UCSD’s more introductory data science course, which introduces Python and Jupyter Notebooks. You’ll find a lot of the material here useful, too.
- Stanford’s Python Reference.
- EECS 201: Computer Science Pragmatics. This class covers “the essentials of using a computer effectively for EECS students,” and covers Unix-like systems, shells, version control, build systems, debugging, and scripting.
Articles
- Facts and myths about Python names and values – good to read if you’re confused about how variables and mutability work in Python.
- Views and Copies in
pandas
– a great read if you’d like to learn more about the infamousSettingWithCopyWarning
. - A Visual Introduction to Machine Learning and Model Tuning and the Bias-Variance Tradeoff, excellent visual descriptions of the last few weeks of the course (some terminology is different, but the ideas are the same).
- MLU Explain, a collection of interactive articles (prepared by Jared Wilber) that explain core machine learning ideas, like cross-validation, random forests, and precision and recall.
Other Links
- pythontutor.com, a tool to visualize the execution of Python programs.
- pandastutor.com, the equivalent of pythontutor.com for DataFrame manipulation.
Regular Expressions
- regex101.com, a helpful site to have open while writing regular expressions.
- Python
re
library documentation and how-to. - regex “cheat sheet” (taken from here).
Finding Datasets
Generic sources of data
These sites allow you to search for datasets (in CSV format) from a variety of different domains. Some may require you to sign up for an account; these are generally reputable sources.
- Kaggle Datasets.
- Google’s dataset search.
- FiveThirtyEight.
- DataHub.io.
- Data.world.
- CORGIS.
- R datasets.
- Wikipedia. (use this site to extract and download tables as CSVs)
- Awesome Public Datasets GitHub repo.
- Links to even more sources
Domain-specific sources of data
- Sports: Basketball Reference, Baseball Reference, etc.
- US Government Sources: census.gov, data.gov, data.ca.gov, data.sfgov.org, FBI’s Crime Data Explorer, Centers for Disease Control and Prevention.
- Global Development: data.worldbank.org, databank.worldbank.org, WHO.
- Transportation: New York Taxi trips, Bureau of Transportation Statistics, SFO Air Traffic Statistics.
- Music: Spotify Charts.
- COVID: Johns Hopkins.
- Any Google Forms survey you’ve administered! (Go to the results spreadsheet, then go to “File > Download > Comma-separated values”.)
Tip: if a site only allows you to download a file as an Excel file, not a CSV file, you can download it, open it in a spreadsheet viewer (Excel, Numbers, Google Sheets), and export it to a CSV.