Syllabus
Vital info
Fall 2024:
- Instructor: Alex Reinhart
- TA: Julia Elrod
- Lecture: Posner 147, MW 12:30-1:50pm
- Office hours:
- Prof. Reinhart: Tuesdays 3-4pm, Baker 232K
- Julia Elrod: Mondays 10-11am, Porter 226D
Course description
This is a course in applied data analysis using regression. Students learn to answer real applied questions using methods including: Simple and multiple linear regression, causation, diagnostics, logistic regression and generalized linear models; Model selection: prediction risk, bias-variance tradeoff, risk estimation, model search, ridge regression and lasso; smoothing and nonparametric regression: linear smoothers, kernels, local regression, splines, additive models. Students will practice real-world data analysis through several course projects culminating in written reports, and the course emphasizes the development of skills in interpreting and explaining statistical results.
Prerequisites
This course is primarily for first-year PhD students in Statistics & Data Science. It requires an appropriate background for entering that program, including linear algebra, multivariate calculus, and basic statistical theory. For example, an appropriate background would be to have received an A in a course on statistical inference like 36-700 or 36-705 and either have extensive experience in statistical data analysis or received As in applied statistics classes focused on analyzing data, such as 36-401/607 and 36-402/608. Students should also be familiar with R on a level similar to 36-350 Statistical Computing, and be able to write a data analysis report (or have experience writing similar reports, such as lab reports).
Students not in the Statistics & Data Science PhD program can add themselves to the waitlist, and should contact the instructor to be added to the class. Describe your prior statistics experience and course work so we can ensure you are prepared for the course. Students interested in an applied regression course requiring less mathematical background should consider 36-401/607 (but contact the instructor to make sure it’s suitable for you!).
Learning objectives
By the end of this course, students will be able to write data analysis reports that integrate text, graphics, and statistical methods to tell a coherent story and answer substantive questions.
To achieve this, students must be able to:
- Derive and compare the statistical properties of regression methods.
- Interpret the statistical meaning of regression results.
- Choose regression procedures that are appropriate for a given dataset and substantive question.
- Select appropriate graphical and statistical diagnostics to verify the assumptions of regression methods.
- Interpret the results of these diagnostics to select appropriate procedures, and explain how any problems affect the interpretation of the results.
- Explain the scientific meaning of regression results in non-statistical terms.
- Write reports that integrate text, graphics, and tables to explain statistical analyses of real data.
- Use R and Quarto to analyze data and construct data analysis reports.
Notice the emphasis on writing, practical data analysis, and modeling data to answer substantive questions. While this course covers the theory of regression and motivates methods statistically, our goal is for you to become well-informed data analysts with a thorough command of the methods you apply to real data. Many parts of this course are designed to prepare Statistics & Data Science PhD students for their first-year Data Analysis Exam.
Textbooks and references
This course website serves as the primary reference for the course. You may find the following additional books to be useful references:
- Sanford Weisberg, Applied Linear Regression, 4th edition (2013). The 4th edition is available using your CMU Library access.
- Trevor Hastie, Robert Tibshirani, and Jerome Friedman, The Elements of Statistical Learning, 2nd edition (2009). Available electronically through SpringerLink or from the authors’ website (but note the $40 Springer MyCopy is black and white, though many figures are in color!).
- George Seber and Alan Lee, Linear Regression Analysis, 2nd edition (2003). Much more theoretical than Weisberg. Available online through Wiley.
- Ronald Christensen, Plane Answers to Complex Questions, 5th edition (2020). A good geometrically based reference to linear model theory. Available electronically through SpringerLink and in $40 MyCopy paperback.
- David Harville, Matrix Algebra from a Statistician’s Perspective. Available through SpringerLink.
- Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund, R for Data Science. Freely available online or can be purchased in print.
You do not need to buy paper copies. I recommend using the electronic versions as needed, both for assigned readings and as a general reference, and if you discover one book is particularly helpful to you, consider buying a copy.
Other readings may be posted on Canvas as needed.
Homework and projects
This course features three major data analysis reports, to be completed individually. In these reports you will practice the skills taught in this class: you will be given a real dataset and several substantive questions about it, and you will examine the data, decide on the best methods to answer the substantive questions, conduct your analysis, and write a report describing your analysis and your results. This is meant to give you realistic practice solving real statistical problems, and particularly to give you practice writing about data analysis for your future research and publications.
These reports are designed to resemble the Data Analysis Exam given to Statistics & Data Science PhD students at the end of their first year.
A rubric for the reports will be posted on Canvas and lists the criteria that define a satisfactory report. Rather than assigning a numerical grade when we review your reports, we’ll give you feedback on how well you met each rubric criterion, and give you the opportunity to revise your report to improve any area that may need more work. See the Grading section below for more details.
Besides the project, you will also complete regular weekly homework. The homework will include theoretical derivations and proofs, plus practical problems conducting simulations, analyzing data, and exploring particular methods. A rubric posted on Canvas will describe the formatting requirements for homework submissions.
Another key weekly activity will be reading. I will assign readings and a few related questions to be answered through Canvas, so that you will be familiar with new topics before we begin discussing them in class. Most readings will come from course website, and the rest will be papers or other materials provided on Canvas.
Late work
For homework and data analysis reports, you will have three “grace days” you can use throughout the semester. Each time you use a grace day for an assignment, you get 24 hours extra to submit the assignment; you may use more than one per assignment. You do not need any excuse to use grace days. Once you have used all three grace days, late work will not be accepted.
This system is meant to allow you flexibility, so that ordinary problems (minor illness, forgot a deadline, had to finish another class’s big assignment, traveled to an event) don’t harm you, and so you do not need my permission to handle unexpected problems. If you experience a serious emergency that prevents you from completing work for a longer time, contact me so we can make arrangements.
Late reading assignments will not be accepted, since reading assignments are intended to prepare you for a specific day of class.
Schedule
This schedule is approximate and subject to change based on our progress during the semester.
Week | Topics | Chapter | Project |
---|---|---|---|
1 | Causality | 2 Causality | |
2 | Labor Day; Basic regression | 4 Linear Regression Basics | |
3 | Multiple regression | 5 Geometric Multiple Regression, 6 Linear Models in R, 7 Interpreting Regressors | |
4 | Diagnostics; expanding the feature space | 8 The Regressinator, 9 Regression Assumptions and Diagnostics, 10 Nonlinear Regressors | |
5 | Inference and report-writing | 11 Conducting Inference, 25 Genre Conventions | Project 1 assigned |
6 | Logistic regression | 12 Logistic Regression | Project 1 due |
7 | GLMs | 13 Other Response Distributions | Project 1 peer reviews due |
8 | Fall Break | ||
9 | Bootstrapping; additive models | 15 The Bootstrap, 14 Generalized Additive Models | Project 1 revisions due |
10 | GAMs; predictive modeling | 17 Prediction Goals and Prediction Errors, 18 Estimating Error | Project 2 assigned |
11 | Penalized regression | 19 Penalized Models | Project 2 due |
12 | Kernel smoothing | 20 Kernel Regression | Project 2 peer reviews due |
13 | Hierarchical models | 23 Mixed and Hierarchical Models | Project 2 revisions due |
14 | Experimental design; Thanksgiving | 24 Experimental Design | |
15 | Missing data; data ethics | 22 Missing Data | Project 3 assigned |
Attendance and participation
Class attendance and participation is essential. If there’s any one message to be learned from pedagogical research, it’s that listening passively to a lecture is not an effective way to learn how to think about complicated problems. As a result, we will use much of our class time for demonstrations and activities, such as
- practicing new data analysis techniques with real data in R
- running simulations to validate tests or diagnostics
- conducting peer review of data analysis reports
- examining data case studies to determine the appropriate statistical methods to solve real problems.
You are expected to attend class and participate in these activities. Many of the activities will be expanded upon in homework assignments and submitted for homework credit.
You should bring your laptop to class, as some activities will involve using R for data analysis. But please do not distract your classmates by using your laptop in class to do things unrelated to the class, however tempting it may be.
If you cannot attend a class for any reason, please let me know as far in advance as is possible. Class sessions are not recorded, and remote attendance of in-person classes is not possible.
Grading
Homework
Each homework problem will be graded on a simple 2-point scale. 2 = satisfactory, perhaps with small flaws that do not affect the conclusion; 1 = satisfactory, but with flaws that do affect the conclusion; 0 = missing or does not address the problem.
Readings
Each assigned reading will be accompanied by a few short questions on Canvas. These will be graded on completion: any good-faith effort to answer the questions will receive 1 point.
Projects
Following the rubric, a project report will be graded Satisfactory if, after revision, the report meets all but a maximum of three of the rubric criteria. Reports that meet all criteria at a high standard of quality will be graded Excellent.
Attendance
Grades can be adjusted up or down one grade level based on class participation, at the instructor’s discretion. Typically this will only be used in extraordinary cases, such as to penalize a failure to attend most sessions or a refusal to participate in class activities.
Final grades
Grades will be assigned according to a table. The homework and reading grades will be averaged, putting 20% of the weight on the readings. You will earn the highest letter grade for which you meet both criteria:
Grade average | Reports | Grade |
---|---|---|
> 85% | 3 excellent | A+ |
> 80% | 2 excellent, 1 satisfactory | A |
> 75% | 1 excellent, 2 satisfactory | B |
> 60% | 3 satisfactory | C |
< 60% | 3 satisfactory | D |
< 60% | < 3 satisfactory | R |
If you have concerns about how any of your work was graded, please discuss your concerns with me within two weeks of the graded work being returned to you.
The homework average column may be adjusted at the instructor’s discretion, but only in your favor.
Note that in Dietrich College, the minimum passing grade in Ph.D. courses is B-. If you elect to take the course pass/fail, the Registrar will automatically convert grades below B- to N (no credit) when grades are posted.
Academic integrity
Collaboration
Discussing homework and projects with your classmates is allowed and encouraged, and helping explain ideas to each other is a core part of the academic experience. But it is important that every student get practice working on their own. This means that all the work you turn in must be your own. You must devise and write your own code, generate your own graphics, and write your own solutions and reports.
Outside sources
You may use external sources (books, websites, papers) to
- Look up R documentation, find useful packages, find explanations for error messages, or remind yourself about the functions to fit some model,
- Find reference materials on statistical methods,
- Clarify material from the course notes or examples.
But external sources must be used to support your work, not to obtain your work. You may not use them to copy code, text, or graphics without attribution. You may not use any prior course’s or textbook’s homework solutions in any way. This prohibition applies even to students who are re-taking the course. Do not copy old solutions (in whole or in part), and do not “consult” or read them. Doing any of that is cheating, making any feedback you get meaningless and any evaluation based on that assignment unfair.
If you do use any material from other sources, you must clearly mark its source. Text taken from other sources must be in quotation marks with citations; figures from other sources need a caption indicating the source; and code from other sources must have a comment indicating the source. We must be able to determine who wrote any material you submit, and you must not falsely imply that you completed work actually done by others.
Generative AI
Some of you may be tempted to use generative AI tools like ChatGPT, Gemini, Llama, or Claude to complete some of your work in this course. My policy towards these tools depends on the type of the assignment, based on their learning goals:
Reading questions: Their purpose is to encourage you to think about the reading and to elicit your thoughts and questions. Using generative AI would defeat the purpose, so generative AI is not permitted.
Homework: Homework enables your reflection on the course concepts and helps you recognize when you do not understand something. Using generative AI would defeat the purpose, so generative AI is not permitted.
Projects: Projects help you practice regression, data analysis, writing, and coding skills. Some of these skills can be assisted by generative AI: for example, it can help debug code or polish text, and can be good at this. You may consult generative AI tools, but you must still take intellectual responsibility for your code, analysis, and writing, and be able to explain and defend every decision. You may not use generative AI to come up with analysis or interpretation for you, only to augment your own work. If you use a generative AI tool, list it as a “coauthor” in your submission to disclose your use.
However, generative AI tools are likely not as good as you think they are. They write grammatically correct text, but they do not write about data analysis like humans do, and their style is very distinct (even if prompted to write in a certain way). If you use generative AI to help write reports, you will likely have to make extensive revisions before your report is graded Satisfactory.
I will consider the use of generative AI tools on assignments where they are not permitted to be “unauthorized assistance”, as defined in the University Policy on Academic Integrity.
Penalties
Please talk to me if you have any questions about this policy. Any form of cheating, unauthorized assistance, or plagiarism is grounds for sanctions to be determined by the instructor, including grade penalties or course failure. Students taking the course pass/fail may have this status revoked. I am also obliged in these situations to report the incident to your academic program and the appropriate University authorities. Please refer to the University Policy on Academic Integrity.
Accommodations for students with disabilities
If you have a disability and have an accommodations letter from the Disability Resources office, I encourage you to discuss your accommodations and needs with me as early in the semester as possible. I will work with you to ensure that accommodations are provided as appropriate. If you suspect that you may have a disability and would benefit from accommodations but are not yet registered with the Office of Disability Resources, I encourage you to contact them at access@andrew.cmu.edu.
Diversity and inclusion
We must treat every individual with respect. We are diverse in many ways, and this diversity is fundamental to building and maintaining an equitable and inclusive campus community. Diversity can refer to multiple ways that we identify ourselves, including but not limited to race, color, national origin, language, sex, disability, age, sexual orientation, gender identity, religion, creed, ancestry, belief, veteran status, or genetic information. Each of these diverse identities, along with many others not mentioned here, shape the perspectives our students, faculty, and staff bring to our campus. We, at CMU, will work to promote diversity, equity and inclusion not only because diversity fuels excellence and innovation, but because we want to pursue justice. We acknowledge our imperfections while we also fully commit to the work, inside and outside of our classrooms, of building and sustaining a campus community that increasingly embraces these core values.
Each of us is responsible for creating a safer, more inclusive environment.
Unfortunately, incidents of bias or discrimination do occur, whether intentional or unintentional. They contribute to creating an unwelcoming environment for individuals and groups at the university. Therefore, the university encourages anyone who experiences or observes unfair or hostile treatment on the basis of identity to speak out for justice and support, within the moment of the incident or after the incident has passed. Anyone can share these experiences using the following resources:
- Center for Student Diversity and Inclusion: csdi@andrew.cmu.edu, (412) 268-2150
- Report-It online anonymous reporting platform. username:
tartans
password:plaid
All reports will be documented and deliberated to determine if there should be any following actions. Regardless of incident type, the university will use all shared experiences to transform our campus climate to be more equitable and just.
Wellness
All of us benefit from support during times of struggle. There are many helpful resources available on campus and an important part of the college experience is learning how to ask for help. Asking for support sooner rather than later is almost always helpful.
If you or anyone you know experiences any academic stress, difficult life events, or feelings like anxiety or depression, we strongly encourage you to seek support. Counseling and Psychological Services (CaPS) is here to help: call 412-268-2922 or visit their website. Consider reaching out to a friend, faculty or family member you trust for help getting connected to the support that can help.