Data analysis (Software Engineering)
This page is for 2016 year!
Scores and deadlines: here
Class email: cshse.ml@gmail.com
Anonymous overall course evaluation form
Anonymous feedback form: here
Содержание
Announcements
Kaggle evaluation
Kaggle evaluation is available here. Please check that your work is in the list. Presentations were evaluated using the rules of the competition. In particular - I was expecting to see:
- that you tried different methods
- table with accuracy results of each method
- description how you tuned the parameters of your model (over which grid, with graphs showing accuracy change)
- data analysis and insights described with illustrative visualizations.
- in feature selection: quantitive results how each feature could be helpful/not helpful.
Exam questions
Exam questions are published and available here.
Kaggle presentation requirements
You should send presentations before June 3 (Friday) 23-59. Presentations should be sent to v.v.kitov@yandex.ru. The title should be "HSE kaggle presentation <team name>". On the title page of the presentation you should list all team participants. Presentation should be in pdf or ppt format and have all components listed in competition rules. Code in py or ipynb format should also be attached to the letter (it may consist of several files).
Early exam
On June 6th, 13-40 - 16-30 there will two lessons. They will cover: 1) a consultation before exam. Please read through all the material and come with your questions. 2) Presentations of top-3 kaggle teams with their solutions (15 minutes each). Teams with over 60 submissions are welcome to tell their findings in the data - what worked and what not (10 minutes each). Everybody else is also welcome (not obliged) to participate with short presentations (5-10 minutes) and tell interesting findings in the data and non-standard approaches that you tried (not necessarily successful).
For your convenience there will be a possibility to take exam in data analysis earlier - on June 6th at 16-40. To take exam earlier you need to request participation to e-mail v.v.kitov@yandex.ru. The number of participants is limited. Note that earlier exam will be the same as official exam and they mutually exclude each other, so you need to select in which exam to participate. Exam program will be available soon. Earlier exam schedule is proposed for your convenience - to give you the possibility to fully concentrate on preparation to data analysis exam.
Course description
In this class we consider the main problems of data mining and machine learning: classification, clustering, regression, dimensionality reduction, ranking, collaborative filtering. We will also study mathematical methods and concepts which data analysis is based on as well as formal assumptions behind them and various aspects of their implementation.
A significant attention is given to practical skills of data analysis that will be developed on seminars by studying the Python programming language and relevant libraries for scientific computing.
The knowledge of linear algebra, real analysis and probability theory is required.
The class consists of:
- Lectures and seminars
- Practical and theoretical homework assignments
- A machine learning competition (more information will be available later)
- Midterm theoretical colloquium
- Final exam
Events outside the course
Universal recomendation system of mail.ru
Description of solutions to different competitions on Kaggle
Neural networks adapt videos to the painting style of famous artists.
Syllabus
- Introduction to machine learning.
- K-nearest neighbours classification and regression. Extensions. Optimization techniques.
- Decision tree methods.
- Bayesian decision theory. Model evaluation:
- Linear classification methods. Adding regularization to linear methods.
- Regression.
- Kernel generalization of standard methods.
- Neural networks.
- Ensemble methods: bagging, boosting, etc.
- Feature selection.
- Feature extraction
- EM algorithm. Density estimation using mixtures.
- Clustering
- Collaborative filtering
- Ranking
Lecture materials
Lecture 1. Introduction to data science and machine learning.
Additional materials: The Field Guide to Data Science, Лекция К.В.Воронцова
Lecture 2. K nearest neighbours method.
Additional materials: Лекция К.В.Воронцова, Metric learning survey
Lecture 3. Decision trees.
Additional materials: Webb, Copsey "Statistical Pattern Recognition", chapter 7.2.
Lecture 4a. Model evaluation.
Binary quality measures. ROC curve, AUC.
Additional materials: Webb, Copsey "Statistical Pattern Recognition", chapter 9.
Lecture 4b. Bayes minimum cost classification.
Case of general losses, common within-class losses and 0,1 losses. Gaussian classifier.
Lecture 5. Linear classifiers.
Discriminant function. Invariance to monotonous transformations for them. Definition for multi-class and binary class cases.
Additional materials: Лекции К.В.Воронцова по линейным методам классификации
Lecture 6. Support vector machines.
Linear separable and linearly non-separable case. Equivalent definition with loss function. Support vectors and non-informative vectors.
Lecture 7. Kernel trick.
Application of kernel trick to SVM. Gaussian, polynomial kernels.
Lecture 8. Regression.
Linear regression and extensions: weighted regression, robust regression, different loss-functions, regression with non-linear features, locally-constant (Nadaraya-Watson) regression.
Lecture 9. Boosting.
Forward stagewise additive modelling. AdaBoost. Gradient boosting.
Additional materials:
Friedman, Hastie, Tibshirani "The Elements of Statistical Learning" - section 10: Boosting and additive trees.,
Мерков "Введение в методы статистического обучения" - секция 4: Линейные комбинации распознавателей.
Lecture 10. Ensemble methods.
Motivation. Bias-variance tradeoff. Bagging, RandomForest, ExtraRandomTrees. Stacking.
Lectures 11, 12. Summary.
Lecture 13. Feature selection.
Lecture 14. Principal components analysis.
Lecture 14. Singular values decomposition.
Download - updated pages 17,18,19.
Lecture 15. Working with text.
Lecture 16. Neural networks.
Lecture 17. Parametric distributions.
Lecture 18. Clustering.
Lecture 19. Mixture densities, EM-algorithm. - updated.
Lecture 20. Recommender systems.
Lecture 21. Kernel density estimation.
Seminars
Seminar 1. Introduction to Data Analysis in Python
Seminar 2. kNN
Theoretical task 2, Practical task 2, data
Additional materials: Visualization tutorial
Seminar 3. Decision trees
Seminar 4. Linear classifiers
Theoretical task 4, Practical task 4, first dataset, diabetes dataset
UPD: At all parts of practical task 4 you should use GD and SGD functions that you program at the fisrt part!
Deadline for this practical task has been changed for some groups! Check it in the table!
Seminar 5. Model evaluation
Seminar 6. Bayesian decision rule
Theoretical task 6, Practical task 6, data
Practical task 6 was completed: the last part was described in more details + there are two small corrections in the first part (they are in bold font). Read it carefully!
Deadline for this practical task has been changed for all groups! Check it in the table!
Seminar 7. SVM and kernel trick
Additional materials: Лекция К.В. Воронцова по SVM
Seminar 8. Regression
Seminar 9. Boosting
Seminar 10. Ensemble methods
Problem 2: a small typo was corrected in the loss function formula.
Seminar 11. Summary
Seminar 12. How to solve practical problems
Dota Competition from the seminar, ipython notebook
Seminar 13. Feature selection
Theoretical task 13, Practical task 13
Practical task is completed.
Seminar 14. Feature extraction
You can read about computing PCA through SVD at the end of this paper.
Seminar 15. Neural networks
Practical task 15, Data, Data in csv format, Censored training set, Theoretical task 15
Additional materials: Backpropagation, PyBrain’s documentation, PyBrain example from the seminar
New data files have been uploaded (there were some problems with reading old ones). Therefore deadline has been changed for some groups! Check it in the table!
If you have MemoryError then read only part of training data from csv files (for example, 30000 objects). You can download censored training set (find link above) or use the following code:
mnist_train = np.loadtxt('mnist_train.csv', delimiter=',')
train_data = ClassificationDataSet(28*28, nb_classes=10)
for i in xrange(len(mnist_train)):
- train_data.appendLinked(mnist_train[i, 1:] / 255., int(mnist_train[i, 0]))
train_data._convertToOneOfMany()
mnist_test = np.loadtxt('mnist_test.csv', delimiter=',')
test_data = ClassificationDataSet(28*28, nb_classes=10)
for i in xrange(len(mnist_test)):
- test_data.appendLinked(mnist_test[i, 1:] / 255., int(mnist_test[i, 0]))
test_data._convertToOneOfMany()
Seminar 16. Clustering
Theoretical task 16, Practical task 16, parrots.jpg, grass.jpg
Seminar 17. Clustering, EM-algorithm
Seminar 18. Recommender systems
Theoretical task 18,Practical task 18, data
Additional materials: Factorization Machines
Columns in the data: 0 - user, 1 - item, 2 - rating, 3 - time (you don't need this one).
In the practical task you should train models on the train data (base) and evaluate on the test data.
Evaluation criteria
The course lasts during the 3rd and 4th modules. Knowledge of students is assessed by evaluation of their home assignments and exams. Home assignments divide into theoretical tasks and practical tasks. There are two exams during the course – after the 3rd module and after the 4th module respectively. Each of the exams evaluates theoretical knowledge and understanding of the material studied during the respective module.
Grade takes values 4,5,…10. Grades, corresponding to 1,2,3 are assumed unsatisfactory. Exact grades are calculated using the following rule:
- score ≥ 35% => 4,
- score ≥ 45% => 5,
- ...
- score ≥ 95% => 10,
where score is calculated using the following rule:
score = 0.6 * S_{homework} + 0.2 * S_{exam1} + 0.2 * S_{exam2} + 0.2 * S_{competition}
- S_{homework} – proportion of correctly solved homework,
- S_{exam1} – proportion of successfully answered theoretical questions during exam after module 3,
- S_{exam2} – proportion of successfully answered theoretical questions during exam after module 4,
- S_{competition} – score for the competition in machine learning (it's also from 0 to 1).
Participation in machine learning competition is optional and can give students extra points.
Plagiarism
In case of discovered plagiarism zero points will be set for the home assignemets - for both works, which were found to be identical. In case of repeated plagiarism by one and the same person a report to the dean will be made.
Deadlines
All the deadlines can be found in the second tab here.
We have two deadlines for each assignments: normal and late. An assignment sent prior to normal deadline is scored with no penalty. The maximum score is penalized by 50% for assignments sent in between of the normal and the late deadline. Assignments sent after late deadlines will not be scored (assigned with zero score) in the absence of legitimate reasons for late submission which do not include high load on other classes.
Standard period for working on a homework assignment is 2 and 4 weeks (normal and late deadlines correspondingly) for practical assignments and 1 and 2 weeks for theoretical ones. The first practical assignment is an exception.
Deadline time: 23:59 of the day before seminar (Sunday for students attending Monday seminars and Wednesday for students that have seminars on Thursday).
Structure of emails and homework submissions
All the questions and submissions must be addressed to cshse.ml@gmail.com. The following subjects must be used:
- For questions (general, regarding assignments, etc): "Question - Surname Name - Group(subgroup)"
- For homework submissions: "Practice/Theory {Lab number} - Surname Name - Group(subgroup)"
Example: Practice 1 - Ivanov Ivan - 131(1)
If you want to address a particular teacher, mention his name in the subject.
Example: Question - Ivanov Ivan - 131(1) - Ekaterina
Please do not mix two different topics in a single email such as theoretical and practical assignments etc. When replying, please use the same thread (i.e. reply to the same email).
Practical assignments must be implemented in ipython notebook format, theoretical ones in pdf. Practical assignments must use Python 2.7. Use your surname as a filename for assignments (e.g. Ivanov.ipynb). Do not archive your assignments.
Assignments can be performed in either Russian or English.
Assignments can be submitted only once!
Useful links
Machine learning
- machinelearning.ru
- Video-lectures of K. Vorontsov on machine learning
- On of the classic ML books. Elements of Statistical Learning (Trevor Hastie, Robert Tibshirani, Jerome Friedman)
Python
- Official website
- Libraries: NumPy, Pandas, SciKit-Learn, Matplotlib.
- A little example for the begginers: краткое руководство с примерами по Python 2
- Python from scratch: A Crash Course in Python for Scientists
- Lectures Scientific Python
- A book: Wes McKinney «Python for Data Analysis»
- Коллекция интересных IPython ноутбуков