Portfolio Homework đ
- Checkpoint due Monday, November 25th at 11:59PM (no slip days allowed!)
- Final Submission due Saturday, December 7th at 11:59PM (no slip days allowed!)
last updated November 5th at 9:49PM
Welcome to Portfolio Homework, the final homework assignment of the semester! đ
This homework is a mini-homework of sorts, and aims to be a culmination of everything youâve learned this semester. In the homework, you will conduct an open-ended investigation into one of the three datasets (Recipes and Ratings đ˝, League of Legends â¨ď¸, or Power Outages đ), or â upon permission of the instructor â choose your own. Specifically, youâll draw several visualizations to help understand the distributions of key variables, build and improve a predictive model, and share your findings with everyone on the internet.
The Portfolio Homework is worth 100 points total; the breakdown is described in the Rubric section at the bottom of this page. The Portfolio Homework is different from other homeworks in a few crucial ways:
- You can work with one partner (but donât have to). If you choose to work with a partner, read the Partner Guidelines section at the bottom.
- There is a checkpoint, due right before Thanksgiving, worth 15 points out of the 100 points the homework is scored out of. It exists to make sure youâve picked a dataset and started some preliminary work. See the Checkpoint Submission section for more details and for the Gradescope submission link.
- You cannot use slip days on either the checkpoint or the final submission, since theyâre due so close to the end of the semester that we need all the time we can get to grade them. All components of the homework are manually graded.
- You cannot drop the Portfolio Homework â it will be a part of your overall homework grade no matter what. Your lowest score among Homeworks 1-11 will be the one that is dropped.
- As your final deliverables, youâll submit two things to us:
- a public-facing website. Weâll eventually create a public âshowcaseâ site that has links to everyoneâs submissions â that is, your website will be available to the entire internet!
- a PDF of your Jupyter Notebook.
Before we get into the details of whatâs expected of you, note that this homework in particular is meant to encourage you to build something youâre proud of. This is a chance to put something concrete on your resume to show potential employers. Grading will likely be lenient, but your work will be publicly available forever!
As alluded to above, the homework is broken into two parts:
- Part 1: An analysis, submitted as a Jupyter Notebook. This will contain the details of your work. Focus on completing your analysis before moving to Part 2, as the analysis is the bulk of the homework.
- Part 2: A report, submitted as a website. This will contain a narrative âstoryâ with visuals. Focus on this after finishing most of your analysis.
The homework is worth a total of 100 points. You can see the distribution of points in the Rubric section at the very bottom.
Table of Contents
Choosing a Dataset
In this homework, you will perform an open-ended investigation into a single dataset.
Default Options
We expect that the majority of students will choose one of the following three datasets.
Recipes and Ratings đ˝   League of Legends â¨ď¸   Power Outages đ
The dataset description pages linked above each have three sections:
- Getting the Data: Describes how to access the data and, in some cases, what various features mean. (In general, youâre going to have to understand what your data means on your own!) Youâre welcome to download additional data to help with your analyses, in addition to using the data thatâs provided for you.
- Example Questions and Prediction Problems: Use these as inspiration, but feel free to come up with your own questions and prediction problems!
- Special Considerations: Things to be aware of when working with the given dataset, e.g. some additional requirements.
When selecting which dataset you are going to use for your homework, try choosing the one whose topic appeals to you the most as that will make finishing the homework a lot more enjoyable.
To help contextualize the kinds of analysis you can do in this homework, it might help to look at these examples from a related course taught at UC San Diego. These examples offer insights into crafting effective research questions, but bear in mind that they have their own strengths and weaknesses. Treat them as a foundation for inspiration, but donât just repeat or copy their work â be original! Also note that the UC San Diego version of this assignment had slightly more requirements, so youâll see sections involving (for example) hypothesis tests that you arenât expected to have.
- League of Legends First Blood Statistical Analysis: This homework excelled in clarifying their research aims, making the study understandable to a broader audience. In your own homework, ensure that you provide a lucid and detailed explanation of your research focus.
- Analyzing Power Outages: This homework presents a noval way to do the data visualization. In your homework, please think about what is the best way to present your data.
Before choosing a dataset, read the rest of this page to see whatâs required of you!
Choosing Your Own Dataset
You may not be interested in any of the above datasets, or may already be doing research/other work involving a dataset that is of particular interest to you. If thatâs the case, you may be permitted to use a different dataset. To request approval, send Suraj (rampure@umich.edu) an email with the following:
- A one paragraph description of why youâve chosen this dataset and your plans for analysis. Comment on what draws you to this dataset over the default options, what preliminary questions you plan to answer, and what features you plan to use in a predictive model. Part of our filtering is making sure that your plans are sufficiently scoped for the homework â that is, your proposal is neither too easy nor too difficult â so the more details, the better.
- Proof that you already have access to the data you want to work with. Ideally, attach a link to the data source or the file so that we can see it ourselves and verify that your plans are possible.
The deadline to request approval to use a different dataset is the same as the deadline of the checkpoint, November 25th. Once you email Suraj with the above, you should hear back within 48 hours with your approval or denial and reasoning. Note that submitting the checkpoint is not the same as requesting approval; the only way to request approval to use a different dataset is to email Suraj with answers to the questions above. After November 25th, if you havenât requested approval, you must choose one of the default three options.
Part 1: Analysis
Before beginning your analysis, youâll need to set up a few things.
- Pull the latest version of the course GitHub repo, github.com/practicaldsc/fa24. Within the
homeworks/portfolio
folder, there is atemplate.ipynb
notebook that you will use as a template for the homework. If you delete the file or want another copy of the template, you can re-download it from here. This is where your analysis will live; you will submit this entire notebook to us. - Download the dataset you chose and load it into your template notebook.
Once you have your dataset loaded in your notebook, itâs time for you to find meaning in the real-world data youâve collected! Follow the steps below.
For each step, we specify what must be done in your notebook and what must go on your website, which we expand on in Part 2. We recommend you write everything in your notebook first, and then move things over to your website once youâve completed your analysis.
In Steps 1-2, youâll develop a deeper understanding of your dataset while trying to answer a single question.
Step 1: Introduction
Step | Analysis in Notebook | Report on Website |
---|---|---|
Introduction and Question Identification | Understand the data you have access to. Brainstorm a few questions that interest you about the dataset. Pick one question you plan to investigate further. (As the data science lifecycle tells us, this question may change as you work on your homework.) | Provide an introduction to your dataset, and clearly state the one question your homework is centered around. Why should readers of your website care about the dataset and your question specifically? Report the number of rows in the dataset, the names of the columns that are relevant to your question, and descriptions of those relevant columns. |
Step 2: Data Cleaning and Exploratory Data Analysis
Step | Analysis in Notebook | Report on Website |
---|---|---|
Data Cleaning | Clean the data appropriately. For instance, you may need to replace data that should be missing with NaN or create new columns out of given ones (e.g. compute distances, apply numerical-to-numerical transformations, or get time information from time stamps). | Describe, in detail, the data cleaning steps you took and how they affected your analyses. The steps should be explained in reference to the data generating process. Show the head of your cleaned DataFrame (see Part 2: Report for instructions). |
Univariate Analysis | Look at the distributions of relevant columns separately by using DataFrame operations and drawing at least two relevant plots. | Embed at least one plotly plot you created in your notebook that displays the distribution of a single column (see Part 2: Report for instructions). Include a 1-2 sentence explanation about your plot, making sure to describe and interpret any trends present, and how they answer your initial question. (Your notebook will likely have more visualizations than your website, and thatâs fine. Feel free to embed more than one univariate visualization in your website if youâd like, but make sure that each embedded plot is accompanied by a description.) |
Bivariate Analysis | Look at the statistics of pairs of columns to identify possible associations. For instance, you may create scatter plots and plot conditional distributions, or box plots. You must plot at least two such plots in your notebook. Be creative! | Embed at least one plotly plot that displays the relationship between two columns. Include a 1-2 sentence explanation about your plot, making sure to describe and interpret any trends present and how they answer your initial question. (Your notebook will likely have more visualizations than your website, and thatâs fine. Feel free to embed more than one bivariate visualization in your website if youâd like, but make sure that each embedded plot is accompanied by a description.) |
Interesting Aggregates | Choose columns to group and pivot by and examine aggregate statistics. | Embed at least one grouped table or pivot table in your website and explain its significance. |
Imputation | If needed for further analyses, impute any missing values. | If you imputed any missing values, visualize the distributions of the imputed columns before and after imputation. Describe which imputation technique you chose to use and why. If you didnât fill in any missing values, discuss why not. |
In Steps 3-5, you will build a predictive model, based on the knowledge of your dataset you developed in Steps 1-2.
Step 3: Framing a Prediction Problem
Step | Analysis in Notebook | Report on Website |
---|---|---|
Problem Identification | Identify a prediction problem. Feel free to use one of the example prediction problems stated in the âExample Questions and Prediction Problemsâ section of your datasetâs description page or pose one of your own. The prediction problem you come up with doesnât have to be related to the question you were answering in Steps 1-2, but ideally, your entire homework has some sort of coherent theme. | Clearly state your prediction problem and type (classification or regression). If you are building a classifier, make sure to state whether you are performing binary classification or multiclass classification. Report the response variable (i.e. the variable you are predicting) and why you chose it, the metric you are using to evaluate your model and why you chose it over other suitable metrics (e.g. accuracy vs. F1-score). Note: Make sure to justify what information you would know at the âtime of predictionâ and to only train your model using those features. For instance, if we wanted to predict your Final Exam grade, we couldnât use your Portfolio Homework grade, because we (probably) wonât have the Portfolio Homework graded before the Final Exam! Feel free to ask questions if youâre not sure. |
Step 4: Baseline Model
Step | Analysis in Notebook | Report on Website |
---|---|---|
Baseline Model | Train a âbaseline modelâ for your prediction task that uses at least two features. (For this requirement, two features means selecting at least two unique columns from your original dataset.) You can leave numerical features as-is, but youâll need to take care of categorical columns using an appropriate encoding. Implement all steps (feature transforms and model training) in a single sklearn Pipeline. Note: Both now and in Step 5: Final Model, make sure to evaluate your modelâs ability to generalize to unseen data! There is no ârequiredâ performance metric that your baseline model needs to achieve. | Describe your model and state the features in your model, including how many are quantitative, ordinal, and nominal, and how you performed any necessary encodings. Report the performance of your model and whether or not you believe your current model is âgoodâ and why. Tip: Make sure to hit all of the points above: many Portfolio Homeworks in the past have lost points for not doing so. |
Step 5: Final Model
Step | Analysis in Notebook | Report on Website |
---|---|---|
Final Model | Create a âfinalâ model that improves upon the âbaselineâ model you created in Step 4. Do so by engineering at least two new features from the data, on top of any categorical encodings you performed in Baseline Model Step. (For instance, you may use a StandardScaler on a quantitative column and a QuantileTransformer transformer on a different column to get two new features.) Again, implement all steps in a single sklearn Pipeline. While deciding what features to use, you must perform a search for the best hyperparameters (e.g. tree depth) to use amongst a list(s) of options, either by using GridSearchCV or through some manual iterative method. In your notebook, state which hyperparameters you plan to tune and why before actually tuning them.Optional: You are encouraged to try many different modeling algorithms for your final model (i.e. LinearRegression , RandomForestClassifier , Lasso , SVC , etc.) If you do this, make sure to clearly indicate in your notebook which model is your actual final model as that will be used to grade the above requirements.Note 1: When training your model, make sure you use the same training and testing sets from your baseline model. This way, the evaluation metric you get on your final model can be compared to your baselineâs on the basis of the model itself and not the dataset it was trained on. Based on which method you use for hyperparameter tuning, this may mean that you will need to use some of your training data as your validation data. If this is the case, make sure to train your final model on the whole training set to evaluation. Note 2: You will not be graded on âhow muchâ your model improved from Step 4: Baseline Model to Step 5: Final Model. What you will be graded on is on whether or not your model improved, as well as your thoughtfulness and effort in creating features, along with the other points above. Note 3: Donât try to improve your modelâs performance just by blindly transforming existing features into new ones. Think critically about what each transformation youâre doing actually does. For example, thereâs no use in using a StandardScaler transformer if your goal is to reduce the MSE of a linear model: as we learned in Lecture 18, standardizing features in a regression model does not change the modelâs predictions, only its coefficients! | State the features you added and why they are good for the data and prediction task. Note that you canât simply state âthese features improved my accuracyâ, since youâd need to choose these features and fit a model before noticing that â instead, talk about why you believe these features improved your modelâs performance from the perspective of the data generating process. Describe the modeling algorithm you chose, the hyperparameters that ended up performing the best, and the method you used to select hyperparameters and your overall model. Describe how your Final Modelâs performance is an improvement over your Baseline Modelâs performance. Optional: Include a visualization that describes your modelâs performance, e.g. a confusion matrix, if applicable. |
Style
While your website will be neatly organized and tailored for public consumption, it is important to keep your analysis notebook organized as well. Follow these guidelines:
- Your work for each of the five homework steps described above (Introduction, Data Cleaning and Exploratory Data Analysis, âŚ, Final Model) should be completed in code cells underneath the Markdown header of that sectionâs name.
- You should only include work that is relevant to posing, explaining, and answering the question(s) you state in your website. You should include data quality, cleaning, and missingness assessments, and intermediate models and features you tried, though these should broadly be relevant to the tasks at hand.
- Make sure to clearly explain what each component of your notebook means. Specifically:
- All plots should have titles, labels, and a legend (if applicable), even if they donât make it into your website. Plots should be self-contained â readers should be able to understand what they describe without having to read anything else.
- All code cells should contain a comment describing how the code works (unless the code is self-explanatory â use your best judgement).
Part 2: Report
The purpose of your website is to provide the general public â your classmates, friends, family, recruiters, and random internet strangers â with an overview of your homework and its findings, without forcing them to understand every last detail. We donât expect the website creation process to take very much time, but it will certainly be rewarding. Once youâve completed your analysis and know what you will put in your website, start reading this section.
Your website must clearly contain the following five headings, corresponding to the five steps mentioned in Part 1:
- Introduction
- Data Cleaning and Exploratory Data Analysis
- Framing a Prediction Problem
- Baseline Model
- Final Model
Donât include âStep 1â, âStep 2â, etc. in your headings â the headings should be exactly as they are above.
The specific content your website needs to contain is described in the âReport on Websiteâ columns above. Make sure to also give your website a creative title that relates to the question youâre trying to answer, and clearly state your name(s) and email(s).
Your report will be in the form of a static website, hosted for free on GitHub Pages. More specifically, youâll use Jekyll, a framework built into GitHub Pages that allows you to create professional-looking websites just by writing Markdown (practicaldsc.org is built using Jekyll!). GitHub Pages does the âhardâ part of converting your Markdown to HTML.
If youâd like to follow the official GitHub Pages & Jekyll tutorial, youâre welcome to, though we will provide you with a perhaps simpler set of instructions here. A very basic site with an embedded visualization can be found at rampure.org/dsc80-proj3-test1/; the source code for the site is here. Note that this example site doesnât have the same headings that youâre required to have.
Step 1: Initializing a Jekyll GitHub Pages Site
- Create a GitHub account, if you donât already have one.
- Create a new GitHub repository for your homework. Give it a descriptive name, like
league-of-legends-analysis
, noteecs398-portfolio-hw
.- GitHub Pages sites live at
<username>.github.io/<reponame>
(for instance, the site for github.com/practicaldsc.org/fa24 is practicaldsc.github.io/fa24, which we redirect to from practicaldsc.org). - As such, donât include âEECS 398â or âPortfolio Homeworkâ in your repoâs name â this looks unprofessional to future employers, and gives you a generic-sounding URL. Instead, mention that this is a homework for EECS 398 at U-M in the repository description.
- Make sure to make your repository public.
- Select âADD a README file.â This ensures that your repository starts off non-empty, which is necessary to continue.
- GitHub Pages sites live at
- Click âSettingsâ in the repository toolbar (next to âInsightsâ), then click âPagesâ in the left menu.
- Under âBranchâ, click the âNoneâ dropdown, change the branch to âmainâ, and then click âSave.â You should soon see âGitHub Pages source saved.â in a blue banner at the top of the page. This initiates GitHub Pages in your repository.
- After 30 seconds, reload the page again. You should see âYour site is live at http://username.github.io/reponame/â. Click that link â you now have a site!
- Click âCodeâ in the repo toolbar to return to the source code for your repo.
Note that the source code for your site lives in README.md
. As you push changes to README.md
, they will update on your site automatically within a few minutes! Before moving forward, make sure that you can make basic edits:
- Clone your repo locally.
- Make some edits to
README.md
. - Push those changes back to GitHub, using the following steps:
- Add your changes to âstagingâ using
git add README.md
(repeat this for any other files you add). - Commit your changes using
git commit -m '<some message here>'
, e.g.git commit -m 'changed title of website'
. - Push your changes using
git push
.
- Add your changes to âstagingâ using
Moving forward, we recommend making edits to your website source code locally, rather than directly on GitHub. This is in part due to the fact that youâll be creating folders and adding files to your source code.
Step 2: Choosing a Theme
The default âthemeâ of a Jekyll site is not all that interesting.
To change the theme of your site:
- Create a file in your repository called
_config.yml
. - Go here, and click the links of various themes until you find one you like.
- Open the linked repository for the theme youâd like to use and scroll down to the âUsageâ section of the README. Copy the necessary information from the README to your
_config.yml
and push it to your site.
For instance, if I wanted to use the Merlot theme, Iâd put the following in my _config.yml
:
remote_theme: pages-themes/merlot@v0.2.0
plugins:
- jekyll-remote-theme # add this line to the plugins list if you already have one
Note that youâre free to use any Jekyll theme, not just the ones that appear here. You are required to choose some theme other than the default, though. See more details about themes here.
Step 3: Embedding Content
Now comes the interesting part â actually including content in your site. The Markdown cheat sheet contains tips on how to format text and other page components in Markdown (and if youâd benefit by seeing an example, you could always look at the Markdown source of this very page â meta!).
What will be a bit trickier is embedding plotly
plots in your site so that they are interactive. Note that you are required to do this, you cannot simply take screenshots of plots from your notebooks and embed them in your site. Hereâs how to embed a plotly
plot directly in your site.
- First, youâll need to convert your plot to HTML. If
fig
is aplotly
Figure
object (for instance, the result of callingpx.express
,go.Figure
, or.plot
whenpd.options.plotting.backend = "plotly"
has been run), then the methodfig.write_html
saves the plot as HTML to a file. Call it usingfig.write_html('file-name.html', include_plotlyjs='cdn')
.- Change
'file-name.html'
to the path where youâd like to initially save your plot. include_plotlyjs='cdn'
tellswrite_html
to load the source code for theplotly
library from a server, rather than including the entire source code in your HTML file. This drastically reduces the size of the resulting HTML file, keeping your repo size down.
- Change
- Move the
.html
file(s) youâve created into a folder in your website repo calledassets
(or something similar).- Depending on where your template notebook is saved, you could combine this step with the step above by calling
fig.write_html
with the correct path (e.g. `fig.write_html(â../league-match-analysis/assets/matches-per-year.htmlâ)).
- Depending on where your template notebook is saved, you could combine this step with the step above by calling
- In
README.md
, embed your plot using the following syntax:
<iframe
src="assets/file-name.html"
width="800"
height="600"
frameborder="0"
></iframe>
iframe
stands for âinline frameâ; it allows us to embed HTML files within other HTML files.- You can change the
width
andheight
arguments, but donât change theframeBorder
argument.
Refer here for a working example (and here for its source code).
Try your best to make your plots look professional and unique to your group â add axis labels, change the font and colors, add annotations, etc. Remember, potential employers will see this â you donât want your plots to look like everyone elseâs! If youâd like to match the styles of the plotly
plots used in lecture (e.g. with the white backgrounds), you can import and use the lec_utils.py
file thatâs in the homeworks/portfolio
folder of our public repo, alongside template.ipynb
.
To convert a DataFrame in your notebook to Markdown source code (which you need to do for both the Data Cleaning and Interesting Aggregates sections of Step 2: Data Cleaning and Exploratory Data Analysis in Part 1), use the .to_markdown()
method on the DataFrame. For instance,
print(counts[['semester', 'Count']].head().to_markdown(index=False))
displays a string, containing the Markdown representation of the first 5 rows of counts
, including only the 'semester'
and 'Count'
columns (and not including the index). You can see how this appears here; remember, no screenshots (and also remember that the âAssessment of Missingnessâ title is not something you need to have, thatâs just an example website). You may need to play with this a little bit so that you donât show DataFrames that are super, super wide and look ugly.
Local Setup
The above instructions give you all you need to create and make updates to your site. However, you may want to set up Jekyll locally, so that you can look at how changes to your site would look without having to push and wait for GitHub to re-build your site. To do so, follow the steps here and then here.
Submission and Rubric
Overall, the homework is worth 100 points. We describe the breakdown in the Rubric section below.
Checkpoint Submission
As mentioned at the top of this page, this homework has a required checkpoint, worth 15 of the 100 points. You can submit the checkpoint here on Gradescope. The checkpoint asks you to answer the following questions:
- (3 points) Which of the three datasets did you choose and why? Or did you get pre-approval to choose your own dataset, and why?
- (6 points) Upload a screenshot of a
plotly
visualization youâve created while completing Part 1, Step 2: Data Cleaning and Exploratory Data Analysis. - (6 points) What is the column you plan on trying to predict in Part 1, Steps 3-5? Is it a classification or regression problem?
Final Submission
You will ultimately submit your homework in two ways:
- By uploading a PDF version of your notebook to the specific âPortfolio Homework Notebook PDF (Dataset)â assignment on Gradescope for your dataset.
- To export your notebook as a PDF, first, restart your kernel and run all cells. Then, go to âFile > Print Previewâ. Then, save a print preview of the webpage as a PDF. There are other ways to save a notebook as a PDF but they may require that you have additional packages installed on your computer, so this is likely the most straightforward.
- Itâs fine if your
plotly
graphs donât render in the PDF output of your notebook. However, make sure none of the code is cut off in your notebookâs PDF. You will lose 5% of the points available on this homework if your code is cut off. - This notebook asks you to include a link to your website; make sure to do so.
- By submitting a link to your website to the âPortfolio Homework Website Link (All Datasets)â assignment on Gradescope.
- We will use the links provided on Gradescope to create a âshowcase siteâ with links to everyoneâs websites so that the rest of the class can see your work!
- Here, youâll also need to provide your group member name(s), email(s), and the title of your website.
To both submissions, make sure to tag your partner. You donât need to submit your actual .ipynb
file anywhere. While your website must be public and you should share it with others, you should not make your code for this homework available publicly.
Remember that you canât use slip days on any part of this homework â not on the checkpoint, not on the final PDF submission, and not on the final website link submission. There are a lot of moving parts to this assignment â donât wait until the last minute to try and submit!
Rubric
Your homework will be graded out of 100 points. The rough rubric is shown below. If you satisfy these requirements as described above, you will receive full credit.
Note that the rubric is intentionally vague when it comes to Steps 3-5. This is because an exact rubric would specify exactly what you need to do to build a model, but figuring out what to do is a large part of Steps 3-5.
Component | Weight |
---|---|
Checkpoint | 15 points |
Step 1: Introduction | 5 points |
Step 2: Data Cleaning and Exploratory Data Analysis ⢠Cleaned data (5 points) ⢠Performed univariate analyses (5 points) ⢠Performed bivariate analyses and aggregations (5 points) ⢠Commented on imputation strategies (5 points) | 20 points |
Step 3: Framing a Prediction Problem | 5 points |
Step 4: Baseline Model | 20 points |
Step 5: Final Model | 25 points |
Overall: Included all necessary components on the website | 10 points |
Total | 100 points |
Partner Guidelines
Working with a partner? Keep the following points in mind:
- You are both required to actively contribute to all parts of the project. You must both be working on the assignment at the same time together, either physically or virtually on a Zoom call. You are encouraged to follow the pair programming model, in which you work on just a single computer and alternate who writes the code and who thinks about the problems at a high level. In particular, you cannot split up the project and each work on separate parts independently. So, you cannot say Partner 1 is responsible for the analysis and Partner 2 is responsible for the website.
- You should decide on your partnership as soon as possible, but certainly before the checkpoint deadline of November 25th (which youâre required to submit on your own).
- Ultimately, you will submit three deliverables for this homework to three separate Gradescope assignments â the checkpoint, a PDF of your notebook, and a link to your website. Make sure to have just one partner submit these deliverables, and have them tag the other partner! That is, donât make duplicate submissions of the same work.