Data analysis (Software Engineering) 2018

Class email: cshse.ml@gmail.com
Anonymous feedback form: here
Scores
Previous Course Page
Course repo
Telegram Group

Содержание

1 Course description
2 Final Exam
3 Colloquium
4 Kaggle
5 Lecture materials
6 Seminars
7 Evaluation criteria
8 Plagiarism
9 Deadlines
10 Structure of emails and homework submissions
11 Useful links

Course description

In this class we consider the main problems of data mining and machine learning: classification, clustering, regression, dimensionality reduction, ranking, collaborative filtering. We will also study mathematical methods and concepts which data analysis is based on as well as formal assumptions behind them and various aspects of their implementation.

A significant attention is given to practical skills of data analysis that will be developed on seminars by studying the Python programming language and relevant libraries for scientific computing.

The knowledge of linear algebra, real analysis and probability theory is required.

The class consists of:

Lectures and seminars
Practical and theoretical homework assignments
A machine learning competition (more information will be available later)
Midterm theoretical colloquium
Final exam

Final Exam

Final exam will be held in the 22nd of June at 10:30 in room 301.

Questions list is awailable here.

Colloquium

Colloquium will be held on the 6th of April.

You may not use any materials during colloquium except single A4 prepared before the exam and handwritten personally by you (from two sides). You will have 2 questions from question list with 20 minutes for preparation and may receive additional questions or tasks.

Kaggle

Link to competition

You should send presentations before June 3 23:59. Presentations should be here. On the title page of the presentation you should list all team participants.

Presentation should be in pdf or ppt format and have all components listed in competition rules. Code in py or ipynb format should also be attached to the letter (it may consist of several files).

Lecture materials

Lecture 1. Introduction to data science and machine learning.
Slides
Additional materials: 1, 2

Lecture 2. Metric methods.
Slides

Lecture 3. Decision trees.
Slides

Lecture 4. Linear regression. Gradient descent.
Slides

Lecture 5. Linear classifiction. Logistic Regression
Slides

Lecture 6. Supervised learning quality measures
Slides

Lecture 7. Support Vector Machines. Kernel Trick
Slides

Lecture 8. Dim. Reduction. PCA, t-SNE
Slides

Lecture 9. Ensembles. Bagging, Stacking, Blending
Slides

Lecture 10. Ensembles. Boosting
Slides

Lecture 11. Neural Networks 1
Slides

Lecture 12. Neural Networks 2
Slides

Lecture 13. Clustering
Slides

Lecture 14. Clustering 2
Slides

Lecture 15. Intro to Recsys
Slides

Seminars

Seminar 1. Introduction to Data Analysis in Python
Python Intro, NumPy Tutorial, Pandas Tutorial
Complete tutorials by 28th of January 23:59 and submit an achive (with your name on it) to this link

Seminar 2. Metric Methods
Practice in class, partially filled
Theoretical task 1. Due date: February 2 23:59

Seminar 3. Decision Trees
Practice in class, partially filled
Theoretical task 2. Due date: February **10** 23:59
Practical task 2, dataset Due date: February 17 23:59

Seminar 4. Linear regression
Practice in class, partially filled
Theoretical task 3. Due date: February 17 23:59

Seminar 5. Linear classification
Practice in class, partially filled
Theoretical task 4. Due date: February 25 23:59
Practical task 3, first_dataset, diabetes. Due date: March 11 23:59

Seminar 6. Quality measures
Practice in class, partially filled
Theoretical task 5. Due date: March 11 23:59

Seminar 7. SVM
Practice in class, partially filled
Theoretical task 6. Due date: March 25 23:59

Seminar 8.1. Feature selection
Practice in class, partially filled
No theoretical task

Seminar 8.2. PCA, t-SNE
Practice in class, partially filled
Theoretical task 7. Due date: April 23 23:59

Seminar 9. Ensembles. Bagging, Stacking
Practice in class, partially filled
Practical task 4. Due date: April 29 23:59. Link to load solution

Seminar 10. Ensembles. Boosting
Practice in class, partially filled
Theoretical task 8. Due date: May 13 23:59
Practical task 5. Due date: May 20 23:59. Data file

Seminar 11. Neural Networks 1
Practice in class, partially filled
Theoretical task 9. Due date: May 20 23:59

Seminar 12. Neural Networks 2
Practice in class

Seminar 13. Clustering
Practice in class, partially filled
Theoretical task 10. Due date: June 12 23:59

Seminar 14. Kaggle Presentations

Seminar 15. Intro to recsys
Practice in class, partially filled

Evaluation criteria

The course lasts during the 3rd and 4th modules. Knowledge of students is assessed by evaluation of their home assignments and exams. Home assignments divide into theoretical tasks and practical tasks. There are two exams during the course – after the 3rd module and after the 4th module respectively. Each of the exams evaluates theoretical knowledge and understanding of the material studied during the respective module.

Grade takes values 4,5,…10. Grades, corresponding to 1,2,3 are assumed unsatisfactory. Exact grades are calculated using the following rule:

score ≥ 35% => 4,
score ≥ 45% => 5,
...
score ≥ 95% => 10,

where score is calculated using the following rule:

score = 0.7 * S_cumulative + 0.3 * S_exam2
cumulative score = 0.8 * S_homework + 0.2 * S_exam1 + 0.2 * S_competition

S_homework – proportion of correctly solved homework,
S_exam1 – proportion of successfully answered theoretical questions during exam after module 3,
S_exam2 – proportion of successfully answered theoretical questions during exam after module 4,
S_competition – score for the competition in machine learning (it's also from 0 to 1).

Participation in machine learning competition is optional and can give students extra points.
"Automative" passing of the course based on cumulative score may be issued.

Plagiarism

In case of discovered plagiarism zero points will be set for the home assignemets - for both works, which were found to be identical. In case of repeated plagiarism by one and the same person a report to the dean will be made.

Deadlines

Assignments sent after late deadlines will not be scored (assigned with zero score) in the absence of legitimate reasons for late submission which do not include high load on other classes.

Structure of emails and homework submissions

Practical assignments must be implemented in jupyter notebook format, theoretical ones in pdf. Practical assignments must use Python 2 (or Python 2 compatible). Use your surname as a filename for assignments (e.g. Ivanov.ipynb). Do not archive your assignments.

Assignments can be performed in either Russian or English.

Assignments can be submitted only once!

Useful links

Machine learning

Python

Official website
Libraries: NumPy, Pandas, SciKit-Learn, Matplotlib.
A little example for the begginers: краткое руководство с примерами по Python 2
Python from scratch: A Crash Course in Python for Scientists
Lectures Scientific Python
A book: Wes McKinney «Python for Data Analysis»
Коллекция интересных IPython ноутбуков

Python installation and configuration

anaconda

Data analysis (Software Engineering) 2018

Содержание

Course description

Final Exam

Colloquium

Kaggle

Lecture materials

Seminars

Evaluation criteria

Plagiarism

Deadlines

Structure of emails and homework submissions

Useful links

Machine learning

Python

Python installation and configuration

Навигация

Персональные инструменты

Пространства имён

Варианты

Просмотры

Действия

Поиск

Навигация

Инструменты