Bootcamp

The core of SSI is a 6-week summer Bootcamp that provides students with the foundations to conduct data science research. The Bootcamp will be approximately 20 hours of total work per week. There will be daily lectures for each course in addition to weekly homework assignments. In addition, students will have access to teaching assistants in review sessions as well as 24/7 discussion boards to ask questions.

Conducting Data Science Research

How to design and conduct a data science project

This course is designed to teach high school students how to complete data science research projects from beginning to end. Students will learn how to define a problem in a field of interest, identify relevant datasets, and formulate a literature-supported research proposal. In addition, students will be taught the modern algorithmic and visualization toolbox for data science projects. Finally, students will be taught how to write publishable research papers, create research posters, and deliver scientific presentations. There will be 5 lectures a week and weekly homework assignments. The teaching team for the course is excited to be designing an innovative curriculum based on their personal experiences with research and high school science fairs and competitions. Course homework assignments will incorporate data science project resources being developed by Eric Zhang, a 2x gold medalist at the International Olympiad in Informatics (IOI).

Franklyn Wang

Course Instructor

Franklyn is a student and teaching assistant for graduate-level probability courses at Harvard. Franklyn was the mentor of the student who placed 3rd in the nation at the 2019 Science Talent Search (STS). In high school, Franklyn was named a Davidson Fellow (top 12 nationally; $25,000 scholarship), Regeneron Science Talent Search (STS) Finalist (top 40 nationally; $25,000 scholarship), and the Siemens 2nd Place National Winner ($50,000 scholarship).  In addition, Franklyn was top 5 nationally in the USA Computing Olympiad, top 20 nationally in the USA Math Olympiad, and top 50 nationally in the USA Physics Olympiad. In college, he was named N2 on the Putnam (Top 20). In 2019 and 2020 he was on the Harvard ICPC Team (Top 3 at school) and finished 2nd and 3rd place in North America, respectively. In 2020, he was named a Goldwater Scholar, the most prestigious fellowship in the natural sciences, mathematics, and engineering. He is also the primary author of a paper published in Operations Research Letters.

Anne Lee

Assistant Instructor

Anne is a student studying computer science and sustainability at Stanford. In high school, Anne was invited to attend the Research Science Institute (RSI), a highly selective high school research fellowship hosted by MIT (top 80 of 1600). At RSI, Anne conducted research at MIT's Computational Materials Design Lab and was the only student in her RSI class to receive top awards for both paper and presentation. In college, her project simulating species extinctions was recognized as the top project in Stanford's CS109 (Probability for Computer Scientists) out of 100+ project submissions. She also received the energyCatalyst grant from the Stanford Tom Kat Center for Sustainable energy to pursue research in monitoring coral reef health through computer vision.

Dhruvik Parikh

Assistant Instructor

Dhruvik is a student studying computer science and economics at Stanford. In high school, Dhruvik was recognized as the Young Scientist at the 2018 International Science and Engineering Fair (ISEF; top 3 overall). He placed second nationally at the National Junior Science and Humanities Symposium (JSHS) and has also been named a Forbes 30 Under 30 recipient. At Stanford, Dhruvik has conducted machine learning research at Stanford's Sustainability and Artificial Intelligence Laboratory. He has also been a software engineer at Voya Sol and Microsoft. Previously, Dhruvik has conducted chemical engineering research at the MIT Hamel Lab and computational biology research at the University of Washington.

Naveen Durvasula

Head Teaching Assistant

Naveen is a student at UC Berkeley’s Management, Entrepreneurship, and Technology program, pursuing a dual degree in EECS and Business Administration. In high school, Naveen was a Research Science Institute (RSI) Scholar and he received the 2018-2019 ACM Cutler-Bell Prize in High School Computing. Naveen has also received awards at the Intel International Science and Engineering Fair, Regeneron Science Talent Search, and the National Junior Science and Humanities Symposium. To date, Naveen has authored/coauthored six papers in a diverse set of applied and theoretical fields, including optimization, topology, algorithms, reinforcement learning, mathematical economics, and stochastic processes. His work has been invited to three conference venues.

Course Outline

Week 1: Defining a Research Problem

1.1: What is Data Science Research, Choosing a Field, Starting With the Question and Not the Data
1.2:
Case Studies on Asking Research Questions
1.3:
Immersing in a Field; Narrowing an Area of Interest
1.4:
Case Studies on Conducting Background Research
1.5:
How to Read Scientific Literature (Finding Relevant Journals, Dissecting Articles, Dealing With Scientific Complexity)

Week 2: Identifying Datasets and Writing a Research Proposal

2.1: Finding Data for a Problem, Surveying Data, Properties of Good and Bad Data, Creating Our Own Data, Cleaning Data, What To Do When We Can’t Find Good Data
2.2:
Case Studies on Finding Datasets
2.3:
Can Our Data Answer Our Question; All Research Projects Say “What Can Be Learned From This Data”, Combining Datasets, Types of Questions (Prediction vs. Inference and Causality)
2.4:
Case Studies on Answerable Questions
2.5:
Writing a Research Proposal (Introduction, Literature Review, Purpose, Datasets, Methodology)

Week 3: Exploratory Data Analysis and Types of Data Science Research

3.1: Exploratory Data Analysis
3.2:
Case Studies on Exploratory Data Analysis 
3.3:
Commanding the Supervised and Unsupervised Toolbox 
3.4:
Case Studies in Prediction and Clustering 
3.5:
Strategies for Data Visualization

Week 4: After the Basics: Advancing your Project

4.1: Correlation Does Not Imply Causation: The Art of Causal Inference
4.2:
Statistics Crash Course: Tests of Significance
4.3:
More Advanced Machine Learning Methods
4.4:
More Advanced Data Visualizations 
4.5:
(In)Formal Justifications and Mathematical Proofs

Week 5: Writing a Research Paper

5.1: Writing a Research Paper (Part I -  Introduction, Purpose, Methodology)
5.2:
Case Studies on Research Papers (Part I)
5.3:
Writing a Research Paper (Part II - Results, Discussion, Conclusion, and Future Investigation)
5.4:
Case Studies on Research Papers (Part II)
5.5:
Writing a Research Abstract 

Week 6: Research Poster and Presentation

6.1: Science Research Posters
6.2:
Case Studies on Science Research Posters
6.3:
Oral Presentations
6.4:
Case Studies on Oral Presentations
6.5:
Science Research Competitions and the Future

Programming for Data Science Research

Teaching the programming skills to conduct interdisciplinary data science research projects

Data science is revolutionizing most sectors of science and technology. In the future, a strong understanding of computer science and data will play an increasingly critical role to making significant breakthroughs in both research and technology. This course will teach students computer programming for data science. The focus of the course will be very applied and geared towards teaching the practical skills for conducting interdisciplinary data science research projects. Most course topics will be taught through examples with real research datasets. There will be 5 lectures a week and weekly homework assignments.

Alex Tsun

Course Instructor

Alex is a Stanford Course Assistant and seasoned instructor who has served as a teaching assistant for "Probability for Computer Scientists" a total of 13 times at UW Seattle and Stanford. He is currently pursuing an M.S. in Computer Science and specializing in AI and theoretical computer science (GPA: 4.06/4.00). As an undergraduate, Alex completed a triple major in computer science, statistics, and mathematics. Alex has worked in the past as a machine learning researcher at LinkedIn, a data scientist at Facebook, and a software engineer at Google.

Adam Pahlavan

Assistant Instructor

Adam is a former Stanford Course Assistant who has held previous jobs in AI research at NVIDIA and in data science at Point72. While an undergraduate at Stanford, Adam studied data science and AI and graduated at the top of his class (GPA: 4.07/4.00; top 8% academic GPA in engineering school). Adam completed his degree at Stanford in 2.5 years. While an undergraduate, he was a course assistant in a graduate-level cryptography course. Adam has done well in some of Stanford's most rigorous graduate-level computer science courses, including Cryptography (top 5%), Machine Learning (top 5%), Convolutional Neural Networks for Visual Recognition (top 10%), and Mining Massive Datasets (top 10%).

Matthew Taing

Head Teaching Assistant

Matthew is a computer science and informatics student from the University of Washington, interested in data science, teaching, and accessibility. He has two years of teaching experience in various courses such as data science foundations and advanced data structures. An avid teacher, Matthew has often gone above and beyond to voluntarily hold extra review sessions, develop curriculum materials, and manage infrastructure. Most recently, Matthew volunteered as a section leader for Code in Place, a program lead by Stanford to remotely teach introductory Python skills

Course Outline

Week 1: Introduction to Python

1.1: Introduction to Python, Variables 
1.2:
For Loops and Nested Loops
1.3:
While Loops and If/Else 
1.4:
Functions
1.5:
Challenge Problems

Week 2: Advanced Python

2.1: Lists, List Comprehensions, and Sorting
2.2:
Sets and Dictionaries
2.3:
Classes
2.4:
The Numpy Library for Scientific Computing
2.5:
Challenge Problems

Week 3: Data Wrangling and Visualization

3.1: Mathematical Typesetting with LaTeX
3.2:
Data Types (Categorical, Continuous), Data Formats (CSV, JSON), and Reading and Writing Data in Python
3.3:
The Pandas Library for Manipulating Data
3.4:
The Matplotlib library for Visualization
3.5:
Challenge Problems

Week 4: Machine Learning: Unsupervised Learning

4.1: Introduction to Probability
4.2:
Introduction to Machine Learning, Motivation, Tasks, Demo
4.3:
Dataset Case Study: Unsupervised Learning Tasks, Clustering
4.4:
Dataset Case Study: Dimensionality Reduction via Principal Components Analysis
4.5:
Challenge Problems

Week 5: Machine Learning: Supervised Learning

5.1: Dataset Case Study: Regression Tasks, Linear Regression
5.2:
Dataset Case Study: Linear Regression with Regularization, Polynomial Regression
5.3:
Dataset Case Study: Classification Tasks, k-Nearest Neighbors
5.4:
Dataset Case Study: Logistic Regression, Support Vector Machines
5.5:
Challenge Problems

Week 6:  Advanced Topics and Next Steps

6.1: Dataset Case Study: Decision Trees, Ensemble Methods
6.2:
Dataset Case Study: Neural Networks and Deep Learning
6.3:
Limitations of Machine Learning, Unsolved Problems, Course Wrap-up
6.4:
Beyond Data Science: Other Fields of Computer Science 
6.5:
Paths to Continue Learning Computer Science in High School

Review Sessions

Course instructors and teaching assistants will hold review sessions throughout the week. Students can attend review sessions to ask teaching staff questions relating to lecture or homework assignments. Every week, there will also be review sessions to go over programming and research homework assignments and common questions students have.

Discussion Board

In addition to review sessions, students can reach out to course instructors and teaching assistants through Piazza, a virtual discussion board. Students can ask questions on discussion boards at any time, and a member of the teaching team will respond as soon as possible. Through discussion boards, students can quickly get feedback, ask questions about homework, and receive assistance debugging code.