Data analysis (Software Engineering)
Scores and deadlines: здесь
Class email: cshse.ml@gmail.com
Anonymous feedback form: написать комментарий или пожелание по курсу
Содержание
[убрать]Class description
In this class we consider the main problems of data mining and machine learning: classification, clustering, regression, dimensionality reduction, ranking, collaborative filtering. We will also study mathematical methods and concepts which data analysis is based on as well as formal assumptions behind them and various aspects of their implementation.
A significant attention is given to practical skills of data analysis that will be developed on seminars by studying the Python programming language and relevant libraries for scientific computing.
The knowledge of linear algebra, real analysis and probability theory is required.
The class consists of:
- Lectures and seminars
- Practical and theoretical homework assignments
- A machine learning competition (more information will be available later)
- Theoretical colloquiums: midterm and final.
- Final written exam
Class program
- Introduction to machine learning.
- K-nearest neighbours classification and regression. Extensions. Optimization techniques.
- Decision tree methods.
- Bayesian decision theory. Model evaluation:
- Linear classification methods. Adding regularization to linear methods.
- Regression.
- Kernel generalization of standard methods.
- Neural networks.
- Ensemble methods: bagging, boosting, etc.
- Feature selection.
- Feature extraction
- EM algorithm. Density estimation using mixtures.
- Clustering
- Collaborative filtering
- Ranking
Lecture materials
Lecture 1. Introduction to data science and machine learning.
Additional materials: The Field Guide to Data Science, Лекция К.В.Воронцова
Lecture 2. K nearest neighbours method.
Additional materials: Лекция К.В.Воронцова, Metric learning survey
Lecture 3. Decision trees.
Additional materials: Webb, Copsey "Statistical Pattern Recognition", chapter 7.2.
Lecture 4a. Model evaluation.
Additional materials: Webb, Copsey "Statistical Pattern Recognition", chapter 9.
Lecture 4b. Bayes minimum cost classification.
Lecture 5. Linear classifiers.
Additional materials: Лекции К.В.Воронцова по линейным методам классификации
Lecture 6. Support vector machines.
Lecture 7. Kernel trick.
Seminars
Seminar 1. Introduction to Data Analysis in Python
Seminar 2. kNN
Theoretical task 2, Practical task 2, data
Additional materials: Visualization tutorial
Seminar 3. Decision trees
Seminar 4. Linear classifiers
Theoretical task 4, Practical task 4, first dataset, diabetes dataset
Deadline for this practical task has been changed for some groups! Check it in the table!
Seminar 5. Model evaluation
Отчётность по курсу и критерии оценки
Оценка за курс. Итоговая оценка за курс складывается из оценок за домашние задания, оценок за коллоквиумы и оценки за экзамен. Оценка за соревновательное задание будет являться бонусной. Точные критерии оценивания будут выложены позднее.
Стандартно практические здания оцениваются по 5-бальной шкале, а теоретические — по 3-бальной.
Плагиат. Всем, у кого обнаружен плагиат ставится 0 баллов и отметка о плагиате. И тем, кто списал, и тем, у кого списали. Мы не будем искать первоисточник работы. Также Вы должны понимать, что плагиат будет иметь и другие последствия. При обнаружении плагиата у одного и того же человека более одного раза на него будет оформляться докладная на имя декана.
Deadlines
All the deadlines can be found in the second tab here.
We have two deadlines for each assignments: normal and late. An assignment sent prior to normal deadline is scored with no penalty. The maximum score is penalized by 50% for assignments sent in between of the normal and the late deadline. Assignments sent after late deadlines will not be scored (assigned with zero score) in the absence of legitimate reasons for late submission which do not include high load on other classes.
Standard period for working on a homework assignment is 2 and 4 weeks (normal and late deadlines correspondingly) for practical assignments and 1 and 2 weeks for theoretical ones. The first practical assignment is an exception.
Deadline time: 23:59 of the day before seminar (Sunday for students attending Monday seminars and Wednesday for students that have seminars on Thursday).
Structure of emails and homework submissions
All the questions and submissions must be addressed to cshse.ml@gmail.com. The following subjects must be used:
- For questions (general, regarding assignments, etc): "Question - Surname Name - Group(subgroup)"
- For homework submissions: "Practice/Theory {Lab number} - Surname Name - Group(subgroup)"
Example: Practice 1 - Ivanov Ivan - 131(1)
If you want to address a particular teacher, mention his name in the subject.
Example: Question - Ivanov Ivan - 131(1) - Ekaterina
Please do not mix two different topics in a single email such as theoretical and practical assignments etc. When replying, please use the same thread (i.e. reply to the same email).
Practical assignments must be implemented in ipython notebook format, theoretical ones in pdf. Practical assignments must use Python 2.7. Use your surname as a filename for assignments (e.g. Ivanov.ipynb). Do not archive your assignments.
Assignments can be performed in either Russian or English.
Assignments can be submitted only once!
Useful links
Machine learning
- machinelearning.ru
- Video-lectures of K. Vorontsov on machine learning
- On of the classic ML books. Elements of Statistical Learning (Trevor Hastie, Robert Tibshirani, Jerome Friedman)
Python
- Official website
- Libraries: NumPy, Pandas, SciKit-Learn, Matplotlib.
- A little example for the begginers: краткое руководство с примерами по Python 2
- Python from scratch: A Crash Course in Python for Scientists
- Lectures Scientific Python
- A book: Wes McKinney «Python for Data Analysis»
- Коллекция интересных IPython ноутбуков