Data analysis (Software Engineering) — различия между версиями

Материал из Wiki - Факультет компьютерных наук
Перейти к: навигация, поиск
(Seminars)
м (course reference added.)
 
(не показано 108 промежуточных версии 6 участников)
Строка 1: Строка 1:
'''Таблица результатов и дедлайнов [https://drive.google.com/open?id=1TQ97B8rqC7sUxTnCMKXoskgRPXO8rAyWoBWBezY58h4 здесь]'''  
+
'''This page is for 2016 year!'''
  
'''Почта курса:''' cshse.ml@gmail.com
+
'''Scores and deadlines: [https://drive.google.com/open?id=1TQ97B8rqC7sUxTnCMKXoskgRPXO8rAyWoBWBezY58h4 here]'''
  
'''Анонимная обратная связь:''' [http://goo.gl/forms/CT3h4QaMeB написать комментарий или пожелание по курсу]
+
'''Class email:''' cshse.ml@gmail.com
 +
 
 +
'''[https://docs.google.com/forms/d/100_gMWQwp41zpHgKuf3fl3SpFSwNj6ggL13DtnxWEEw/viewform Anonymous overall course evaluation form] '''
 +
 
 +
'''Anonymous feedback form:''' [http://goo.gl/forms/CT3h4QaMeB here]
 
<br />
 
<br />
  
== Краткое описание ==
+
== Announcements ==
В курсе рассматриваются основные задачи анализа данных и обучения по прецедентам: классификация, кластеризация, регрессия, понижение размерности, ранжирование, коллаборативная фильрация. По изложению для каждой рассматриваемой задачи изучаются математические основы методов, лежащие в их основе предположения о данных, взаимосвязи методов между собой и особенности их практического применения.
+
 
 +
===Kaggle evaluation===
 +
Kaggle evaluation [https://docs.google.com/spreadsheets/d/1h90zLpQ2Q8QD_xQLGJLY1xljnLGF175Re4fTMChHFgA/edit?usp=sharing is available here]. Please check that your work is in the list. Presentations were evaluated using [https://inclass.kaggle.com/c/hse-spring2016-stack-overflow/rules the rules of the competition]. In particular - I was expecting to see:
 +
* that you tried different methods
 +
* table with accuracy results of each method
 +
* description how you tuned the parameters of your model (over which grid, with graphs showing accuracy change)
 +
* data analysis and insights described with illustrative visualizations.
 +
* in feature selection: quantitive results how each feature could be helpful/not helpful.
 +
 
 +
===Exam questions===
 +
Exam questions are published and available [https://yadi.sk/i/Gi_T_R4AsD2aU here].
 +
 
 +
===Kaggle presentation requirements===
 +
You should send presentations before June 3 (Friday) 23-59. Presentations should be sent to v.v.kitov@yandex.ru. The title should be "HSE kaggle presentation <team name>". On the title page of the presentation you should list all team participants. Presentation should be in pdf or ppt format and have all components listed in [https://inclass.kaggle.com/c/hse-spring2016-stack-overflow/rules competition rules]. Code in py or ipynb format should also be attached to the letter (it may consist of several files).
 +
 
 +
===Early exam===
 +
On June 6th, 13-40 - 16-30 there will two lessons. They will cover:
 +
1) a consultation before exam. Please read through all the material and come with your questions.
 +
2) Presentations of top-3 kaggle teams with their solutions (15 minutes each). Teams with over 60 submissions are welcome to tell their findings in the data - what worked and what not (10 minutes each). Everybody else is also welcome (not obliged) to participate with short presentations (5-10 minutes) and tell interesting findings in the data and non-standard approaches that you tried (not necessarily successful).
 +
 
 +
For your convenience there will be a possibility to take exam in data analysis earlier - on June 6th at 16-40. To take exam earlier you need to request participation to e-mail v.v.kitov@yandex.ru. The number of participants is limited. Note that earlier exam will be the same as official exam and they mutually exclude each other, so you need to select in which exam to participate. Exam program will be available soon. Earlier exam schedule is proposed for your convenience - to give you the possibility to fully concentrate on preparation to data analysis exam.
 +
 
 +
== Course description ==
 +
In this class we consider the main problems of data mining and machine learning: classification, clustering, regression, dimensionality reduction, ranking, collaborative filtering. We will also study mathematical methods and concepts which data analysis is based on as well as formal assumptions behind them and various aspects of their implementation.
 +
 
 +
A significant attention is given to practical skills of data analysis that will be developed on seminars by studying the Python programming language and relevant libraries for scientific computing. To learn more join [https://www.edureka.co/python-programming-certification-training Python training] today.
 +
 
 +
The knowledge of linear algebra, real analysis and probability theory is required.
 +
 
 +
'''The class consists of:'''
 +
# Lectures and seminars
 +
# Practical and theoretical homework assignments
 +
# A machine learning competition (more information will be available later)
 +
# Midterm theoretical colloquium
 +
# Final exam
 +
 
 +
== Events outside the course ==
  
Большое внимание уделено освоению практических навыков анализа данных, отрабатываемых на семинарах, которое будет вестись с использованием языка python и соответствующих библиотек для научных вычислений.
+
[https://it.mail.ru/announcements/36/?utm_campaign=newsletter&utm_medium=email&utm_source=newsletter_2742016df Universal recomendation system of mail.ru]
  
От студентов требуются знания линейной алгебры, математического анализа и теории вероятностей.
+
[https://www.youtube.com/channel/UCeq6ZIlvC9SVsfhfKnSvM9w Description of solutions to different competitions on Kaggle]
  
'''Курс включает в себя:'''
+
[http://ria.ru/science/20160514/1432666353.html Neural networks adapt videos to the painting style of famous artists.]
# Лекции и семинары
+
# Практические и теоретические домашние задания
+
# Одно соревновательное задание (информация будет уточнена позднее)
+
# Два теоретических коллоквиума: в середине и в конце семестра
+
# Письменный экзамен в конце семестра
+
  
== Программа курса ==
+
== Syllabus ==
  
 
# Introduction to machine learning.
 
# Introduction to machine learning.
Строка 54: Строка 89:
 
'''Lecture 3. Decision trees. '''
 
'''Lecture 3. Decision trees. '''
  
[https://yadi.sk/i/oGUi8YKFnfYFf Download]
+
[https://yadi.sk/i/-vPl2vaBqXrt5 Download]
  
 
Additional materials: ''Webb, Copsey "Statistical Pattern Recognition", chapter 7.2.''
 
Additional materials: ''Webb, Copsey "Statistical Pattern Recognition", chapter 7.2.''
  
'''Lecture 4. Model evaluation. '''
+
'''Lecture 4a. Model evaluation. '''
 +
 
 +
Binary quality measures. ROC curve, AUC.
  
 
[https://yadi.sk/i/7V7U_1QtnfYZQ Download]
 
[https://yadi.sk/i/7V7U_1QtnfYZQ Download]
  
 
Additional materials: ''Webb, Copsey "Statistical Pattern Recognition", chapter 9.''
 
Additional materials: ''Webb, Copsey "Statistical Pattern Recognition", chapter 9.''
 +
 +
'''Lecture 4b. Bayes minimum cost classification. '''
 +
 +
Case of general losses, common within-class losses and 0,1 losses. Gaussian classifier.
 +
 +
[https://yadi.sk/i/6jIvLiYMosuz5 Download]
  
 
'''Lecture 5. Linear classifiers. '''
 
'''Lecture 5. Linear classifiers. '''
  
[https://yadi.sk/i/E1MY5e4JnfYf4 Download]
+
Discriminant function. Invariance to monotonous transformations for them. Definition for multi-class and binary class cases.
 +
 
 +
[https://yadi.sk/i/0IQ6P3LDoqpRk Download]
  
 
Additional materials: [http://www.machinelearning.ru/wiki/images/5/53/Voron-ML-Lin-SG.pdf Лекции К.В.Воронцова по линейным методам классификации]
 
Additional materials: [http://www.machinelearning.ru/wiki/images/5/53/Voron-ML-Lin-SG.pdf Лекции К.В.Воронцова по линейным методам классификации]
 +
 +
'''Lecture 6. Support vector machines. '''
 +
 +
Linear separable and linearly non-separable case. Equivalent definition with loss function. Support vectors and non-informative vectors.
 +
 +
[https://yadi.sk/i/baP7gbWXoqpTE Download]
 +
 +
'''Lecture 7. Kernel trick. '''
 +
 +
Application of kernel trick to SVM. Gaussian, polynomial kernels.
 +
 +
[https://yadi.sk/i/3W6A9FmZoqpU9 Download]
 +
 +
'''Lecture 8. Regression. '''
 +
 +
Linear regression and extensions: weighted regression, robust regression, different loss-functions, regression with non-linear features, locally-constant (Nadaraya-Watson) regression.
 +
 +
[https://yadi.sk/i/HSp51pmepjQBq Download]
 +
 +
'''Lecture 9. Boosting. '''
 +
 +
Forward stagewise additive modelling. AdaBoost. Gradient boosting.
 +
 +
[https://yadi.sk/i/3yTKLDcCpjLhG Download]
 +
 +
Additional materials:
 +
 
 +
[http://statweb.stanford.edu/~tibs/ElemStatLearn/ Friedman, Hastie, Tibshirani "The Elements of Statistical Learning"] - section 10: Boosting and additive trees.,
 +
 +
[http://www.recognition.mccme.ru/pub/RecognitionLab.html/slbook.pdf Мерков "Введение в методы статистического обучения"] - секция 4: Линейные комбинации распознавателей.
 +
 +
'''Lecture 10. Ensemble methods. '''
 +
 +
Motivation. Bias-variance tradeoff. Bagging, RandomForest, ExtraRandomTrees. Stacking.
 +
 +
[https://yadi.sk/i/8E-wZxIpq9Sgf Download]
 +
 +
'''Lectures 11, 12. Summary. '''
 +
 +
'''Lecture 13. Feature selection. '''
 +
 +
[https://yadi.sk/i/2ZLC3J6dr3iAc Download]
 +
 +
'''Lecture 14. Principal components analysis. '''
 +
 +
[https://yadi.sk/i/6DxoScrKrN3LK Download]
 +
 +
'''Lecture 14. Singular values decomposition. '''
 +
 +
[https://yadi.sk/i/zwJftdkUrN6Td Download] - updated pages 17,18,19.
 +
 +
'''Lecture 15. Working with text. '''
 +
 +
[https://yadi.sk/i/YqdRr-0erEXba Download]
 +
 +
'''Lecture 16. Neural networks. '''
 +
 +
[https://yadi.sk/i/9yMM7mkrrEXbc Download]
 +
 +
'''Lecture 17. Parametric distributions. '''
 +
 +
[https://yadi.sk/i/D33hdVXTrkxDN Download]
 +
 +
'''Lecture 18. Clustering. '''
 +
 +
[https://yadi.sk/i/3DJY4Oo7rkxEW Download]
 +
 +
'''Lecture 19. Mixture densities, EM-algorithm. ''' - updated.
 +
 +
[https://yadi.sk/i/22wLOL2krkxGL Download]
 +
 +
'''Lecture 20. Recommender systems. '''
 +
 +
[https://yadi.sk/i/mMhaZlqLrvjph Download]
 +
 +
'''Lecture 21. Kernel density estimation. '''
 +
 +
[https://yadi.sk/i/zF7ZPyTCsAjFH Download]
  
 
== Seminars ==
 
== Seminars ==
Строка 90: Строка 213:
 
'''Seminar 4. Linear classifiers '''
 
'''Seminar 4. Linear classifiers '''
 
   
 
   
[https://drive.google.com/file/d/0B7TWwiIrcJstc1J4SnRWTnlxZlE/view?usp=sharing Theoretical task 4], [https://drive.google.com/file/d/0B7TWwiIrcJstbzM1bE1XWnd2Qm8/view?usp=sharing Practical task 4], [https://drive.google.com/file/d/0B7TWwiIrcJstM3VSQTVrYWhqYk0/view?usp=sharing first dataset], [https://drive.google.com/file/d/0B7TWwiIrcJstN0VOUV9PcDc1ZlE/view?usp=sharing diabetes dataset]
+
[https://drive.google.com/file/d/0B7TWwiIrcJstc1J4SnRWTnlxZlE/view?usp=sharing Theoretical task 4], [https://drive.google.com/file/d/0B7TWwiIrcJstckZCN1pYcEo2NW8/view?usp=sharing Practical task 4], [https://drive.google.com/file/d/0B7TWwiIrcJstM3VSQTVrYWhqYk0/view?usp=sharing first dataset], [https://drive.google.com/file/d/0B7TWwiIrcJstN0VOUV9PcDc1ZlE/view?usp=sharing diabetes dataset]
 +
 
 +
UPD: At all parts of practical task 4 you should use GD and SGD functions that you program at the fisrt part!
  
 
''Deadline for this practical task has been changed for some groups! Check it in the table!''
 
''Deadline for this practical task has been changed for some groups! Check it in the table!''
Строка 98: Строка 223:
 
[https://drive.google.com/file/d/0B7TWwiIrcJstdTFST0Z4UkRoaEk/view?usp=sharing Theoretical task 5]
 
[https://drive.google.com/file/d/0B7TWwiIrcJstdTFST0Z4UkRoaEk/view?usp=sharing Theoretical task 5]
  
== Отчётность по курсу и критерии оценки ==
+
'''Seminar 6. Bayesian decision rule '''
'''Оценка за курс.''' Итоговая оценка за курс складывается из оценок за домашние задания, оценок за коллоквиумы и оценки за экзамен. Оценка за соревновательное задание будет являться бонусной. Точные критерии оценивания будут выложены позднее.  
+
 +
[https://drive.google.com/file/d/0B7TWwiIrcJstWF9zcllDU01ZY2M/view?usp=sharing Theoretical task 6], [https://drive.google.com/file/d/0B7TWwiIrcJstSHRKcFlVcy1xd3M/view?usp=sharing Practical task 6], [https://drive.google.com/file/d/0B7TWwiIrcJstVlhBdGZYMm94SHc/view?usp=sharing data]
 +
 
 +
Practical task 6 was completed: the last part was described in more details + there are two small corrections in the first part (they are in bold font). Read it carefully!
 +
 
 +
'''Deadline for this practical task has been changed for all groups! Check it in the table!'''
 +
 
 +
'''Seminar 7. SVM and kernel trick '''
 +
 +
[https://drive.google.com/file/d/0B7TWwiIrcJstUGhEX2QxV1gycDA/view?usp=sharing Theoretical task 7]
 +
 
 +
Additional materials: [http://www.ccas.ru/voron/download/SVM.pdf Лекция К.В. Воронцова по SVM]
 +
 
 +
'''Seminar 8. Regression '''
 +
 
 +
[https://drive.google.com/file/d/0B7TWwiIrcJstNkE0dVJscGFSOEE/view?usp=sharing Theoretical task 8]
 +
 
 +
'''Seminar 9. Boosting '''
 +
 
 +
[https://drive.google.com/file/d/0B7TWwiIrcJstVkx2Z0tGbi1yNFE/view?usp=sharing Practical task 9],
 +
[https://drive.google.com/file/d/0B7TWwiIrcJstelRydjBUcUlleUk/view?usp=sharing data]
 +
 
 +
'''Seminar 10. Ensemble methods '''
 +
 
 +
[https://drive.google.com/file/d/0B7TWwiIrcJstR28zVlN3OUZJTEU/view?usp=sharing Theoretical task 10]
 +
 
 +
''Problem 2: a small typo was corrected in the loss function formula.''
 +
 
 +
'''Seminar 11.  Summary'''
 +
 
 +
'''Seminar 12.  How to solve practical problems'''
 +
 
 +
[https://inclass.kaggle.com/c/cmc-msu-machine-learning-spring-2015-2016-dota-competition Dota Competition from the seminar], [https://drive.google.com/file/d/0B7TWwiIrcJstVnF0RVkzc1c1TVE/view?usp=sharing ipython notebook]
 +
 
 +
'''Seminar 13.  Feature selection'''
 +
 
 +
[https://drive.google.com/file/d/0B7TWwiIrcJstTUU4OGFnQzBPa3c/view?usp=sharing Theoretical task 13], [https://drive.google.com/file/d/0B7TWwiIrcJstM3JkWnhtQ2YzZms/view?usp=sharing Practical task 13]
 +
 
 +
''Practical task is completed.''
 +
 
 +
'''Seminar 14.  Feature extraction'''
 +
 
 +
You can read about computing PCA through SVD at the end of [https://drive.google.com/file/d/0B7TWwiIrcJstWFFSOUI5aTRBM00/view?usp=sharing this paper].
 +
 
 +
'''Seminar 15.  Neural networks'''
 +
 
 +
[https://drive.google.com/file/d/0B0s4HdnpuPdxbk4yeXZyVi10dVE/view?usp=sharing Practical task 15],
 +
[https://www.dropbox.com/s/r0u3xh5ybtstw9c/mnist.zip?dl=0 Data],
 +
[https://drive.google.com/file/d/0B7TWwiIrcJstX1FzMWhQdmdGekk/view?usp=sharing Data in csv format],
 +
[https://drive.google.com/file/d/0B0QWEJMlsxfRS2lWSkR5LVc2MzA/view?usp=sharing Censored training set],
 +
[https://drive.google.com/file/d/0B0QWEJMlsxfRZmlHVHdlYWdQSTQ/view?usp=sharing Theoretical task 15]
 +
 
 +
Additional materials:  [https://drive.google.com/file/d/0B7TWwiIrcJstd3pOWjcwUUNOaUk/view?usp=sharing Backpropagation], [http://pybrain.org/docs/ PyBrain’s documentation], [https://drive.google.com/file/d/0B7TWwiIrcJstSGp0SzNTa1RJeTQ/view?usp=sharing PyBrain example from the seminar]
 +
 
 +
''New data files have been uploaded (there were some problems with reading old ones). Therefore deadline has been changed for some groups! Check it in the table!''
 +
 
 +
If you have MemoryError then read only part of training data from csv files (for example, 30000 objects). You can download '''censored training set''' (find link above) or use the following code:
 +
 
 +
mnist_train = np.loadtxt('mnist_train.csv', delimiter=',')<br />
 +
train_data = ClassificationDataSet(28*28, nb_classes=10)<br />
 +
for i in xrange(len(mnist_train)):<br />
 +
:train_data.appendLinked(mnist_train[i, 1:] / 255., int(mnist_train[i, 0]))<br />
 +
train_data._convertToOneOfMany()<br />
 +
 
 +
mnist_test = np.loadtxt('mnist_test.csv', delimiter=',')<br />
 +
test_data = ClassificationDataSet(28*28, nb_classes=10)<br />
 +
for i in xrange(len(mnist_test)):<br />
 +
:test_data.appendLinked(mnist_test[i, 1:] / 255., int(mnist_test[i, 0]))<br />
 +
test_data._convertToOneOfMany()
 +
 
 +
'''Seminar 16.  Clustering'''
 +
 
 +
[https://drive.google.com/file/d/0B0QWEJMlsxfRRy1fd1RNVVhpMm8/view?usp=sharing Theoretical task 16], [https://drive.google.com/file/d/0B7TWwiIrcJstM1l5M3kwNDFZQlU/view?usp=sharing Practical task 16], [https://drive.google.com/file/d/0B7TWwiIrcJstZ2xIRU00dTB0OHc/view?usp=sharing parrots.jpg], [https://drive.google.com/file/d/0B7TWwiIrcJstb0RDc0RiQ1M3OXc/view?usp=sharing grass.jpg]
 +
 
 +
'''Seminar 17. Clustering, EM-algorithm'''
 +
 
 +
[https://drive.google.com/file/d/0B0s4HdnpuPdxLWJjUEtCVVNFa00/view?usp=sharing Theoretical task 17]
 +
 
 +
'''Seminar 18. Recommender systems'''
 +
 
 +
[https://drive.google.com/file/d/0B7TWwiIrcJstWkt3Qk5Cb0FWTUk/view?usp=sharing Theoretical task 18],[https://drive.google.com/file/d/0B7TWwiIrcJstbW5Sd3hBMjNoNkE/view?usp=sharing Practical task 18], [https://www.dropbox.com/s/5as2k79dhajtw7n/data.zip?dl=0, data]
 +
 
 +
Additional materials:  [https://www.semanticscholar.org/paper/Factorization-Machines-Rendle/2ef7d506b25731d0f3ec0c8f90b718b6e5bbd069/pdf Factorization Machines]
 +
 
 +
Columns in the data: 0 - user, 1 - item, 2 - rating, 3 - time (you don't need this one).
 +
 
 +
In the practical task you should train models on the train data (base) and evaluate on the test data.
 +
 
 +
== Evaluation criteria ==
 +
The course lasts during the 3rd and 4th modules. Knowledge of students is assessed by evaluation of their home assignments and exams. Home assignments divide into theoretical tasks and practical tasks. There are two exams during the course – after the 3rd module and after the 4th module respectively. Each of the exams evaluates theoretical knowledge and understanding of the material studied during the respective module.
 +
 
 +
Grade takes values 4,5,…10. Grades, corresponding to 1,2,3 are assumed unsatisfactory. Exact grades are calculated using the following rule:
 +
 
 +
* '''score''' ≥ 35% => 4,
 +
* '''score''' ≥ 45% => 5,
 +
* ...
 +
* '''score''' ≥ 95% => 10,
 +
 
 +
where '''score''' is calculated  using the following rule:
 +
 
 +
'''score''' = 0.6 * S<sub>homework</sub> + 0.2 * S<sub>exam1</sub> + 0.2 * S<sub>exam2</sub> + 0.2 * S<sub>competition</sub>
  
Стандартно практические здания оцениваются по 5-бальной шкале, а теоретические — по 3-бальной.
+
* S<sub>homework</sub> – proportion of correctly solved homework,
 +
* S<sub>exam1</sub> – proportion of successfully answered theoretical questions during exam after module 3,
 +
* S<sub>exam2</sub> – proportion of successfully answered theoretical questions during exam after module 4,
 +
* S<sub>competition</sub> – score for the competition in machine learning (it's also from 0 to 1).
  
'''Плагиат.''' Всем, у кого обнаружен плагиат ставится 0 баллов и отметка о плагиате. И тем, кто списал, и тем, у кого списали. Мы не будем искать первоисточник работы.
+
Participation in machine learning competition is optional and can give students extra points.
Также Вы должны понимать, что плагиат будет иметь и другие последствия. При обнаружении плагиата у одного и того же человека более одного раза на него будет оформляться докладная на имя декана.
+
  
== Дедлайны ==
+
== Plagiarism ==
 +
In case of discovered plagiarism zero points will be set for the home assignemets - for both works, which were found to be identical. In case of repeated plagiarism by one and the same person a report to the dean will be made.
  
Точные даты дедлайнов указаны на второй вкладке [https://drive.google.com/open?id=1TQ97B8rqC7sUxTnCMKXoskgRPXO8rAyWoBWBezY58h4 здесь].
+
== Deadlines ==
  
Для сдачи домашних заданий будет два вида дедлайнов: обычный и поздний. При сдаче задания до обычного дедлайна можно получить за него полное количество баллов. За сдачу задания после обычного дедлайна, но до позднего дедлайна можно получить только 50% от полного количества баллов. Решения присланные после позднего дедлайна не принимаются, кроме случаев наличия уважительных причин у студента (завалы на учебе или работе уважительными причинами не считаются).  
+
All the deadlines can be found in the second tab [https://drive.google.com/open?id=1TQ97B8rqC7sUxTnCMKXoskgRPXO8rAyWoBWBezY58h4 here].
  
Стандартный срок для выполнения практического домашнего задания (обычный и поздний дедлайны): 2 и 4 недели, для теоретического задания: 1 и 2 недели. Исключение: 1 практическое задание.
+
We have two deadlines for each assignments: normal and late. An assignment sent prior to normal deadline is scored with no penalty. The maximum score is penalized by 50% for assignments sent in between of the normal and the late deadline. Assignments sent after late deadlines will not be scored (assigned with zero score) in the absence of legitimate reasons for late submission which do not include high load on other classes.  
  
'''Время дедлайнов:''' 23:59 в день, предшествующий дню семинара (воскресенье для групп с семинарами по понедельникам и среда для групп с семинарами по четвергам).
+
Standard period for working on a homework assignment is 2 and 4 weeks (normal and late deadlines correspondingly) for practical assignments and 1 and 2 weeks for theoretical ones. The first practical assignment is an exception.
  
== Оформление писем и заданий ==
+
'''Deadline time:''' 23:59 of the day before seminar (Sunday for students attending Monday seminars and Wednesday for students that have seminars on Thursday).
Вопросы и домашние задания присылайте на почтовый адрес '''cshse.ml@gmail.com'''.
+
На почту присылайте письма со следующими темами:
+
* Для ''вопросов'' (общих, по лабораторным, по теории и т. д.): "Вопрос - Фамилия Имя - Группа(подгруппа)"
+
* Для ''заданий'': "Практика/Теория {Номер работы} - Фамилия Имя - Группа(подгруппа)"
+
  
''Пример'': Практика 1 - Иванов Иван - 131(1)
+
== Structure of emails and homework submissions ==
 +
All the questions and submissions must be addressed to '''cshse.ml@gmail.com'''.
 +
The following subjects must be used:
 +
* For ''questions'' (general, regarding assignments, etc): "Question - Surname Name - Group(subgroup)"
 +
* For ''homework submissions'': "Practice/Theory {Lab number} - Surname Name - Group(subgroup)"
  
Если вопрос адресован конкретному преподавателю, то также укажите его имя в теме письма.
+
''Example'': Practice 1 - Ivanov Ivan - 131(1)
  
''Пример'': Вопрос - Иванов Иван - 131(1) - Екатерина
+
If you want to address a particular teacher, mention his name in the subject.
  
Просьба не смешивать темы, то есть не нужно присылать в одном письме практическое задание и домашнее.
+
''Example'': Question - Ivanov Ivan - 131(1) - Ekaterina
  
Когда отвечаете на наши письма или досылаете какие-то решения, пишите письма в '''тот же''' тред.
+
Please do not mix two different topics in a single email such as theoretical and practical assignments etc. When replying, please use the  '''same''' thread (i.e. reply to the same email).
  
Практические задания нужно сдавать в ipython notebook, а теоретические формате pdf. Практические задания нужно выполнять с использованием '''Python 2.7'''. В качестве названия для файла с работой используйте свою фамилию на английском языке. Не нужно архивировать файлы перед отправкой.
+
Practical assignments must be implemented in ipython notebook format, theoretical ones in pdf. Practical assignments must use '''Python 2.7'''. Use your surname as a filename for assignments (e.g. Ivanov.ipynb). Do not archive your assignments.
  
Задания можно сдавать как на английском, так и на русском языке.
+
Assignments can be performed in either Russian or English.
  
'''Задания можно отправлять только один раз!'''
+
'''Assignments can be submitted only once!'''
  
== Полезные ссылки ==
+
== Useful links ==
=== Машинное обучение ===
+
[[Тест|.]]
 +
=== Machine learning ===
 
* [http://www.machinelearning.ru/wiki/index.php?title=Заглавная_страница machinelearning.ru]
 
* [http://www.machinelearning.ru/wiki/index.php?title=Заглавная_страница machinelearning.ru]
* [https://yandexdataschool.ru/edu-process/courses/machine-learning Видеолекции курса К.В.Воронцова по машинному обучению]
+
* [https://yandexdataschool.ru/edu-process/courses/machine-learning Video-lectures of K. Vorontsov on machine learning]
* Одна из классических и наиболее полных книг по машинному обучению. [http://web.stanford.edu/~hastie/local.ftp/Springer/ESLII_print10.pdf Elements of Statistical Learning (Trevor Hastie, Robert Tibshirani, Jerome Friedman)]
+
* On of the classic ML books. [http://web.stanford.edu/~hastie/local.ftp/Springer/ESLII_print10.pdf Elements of Statistical Learning (Trevor Hastie, Robert Tibshirani, Jerome Friedman)]
  
 
=== Python ===
 
=== Python ===
* [http://python.org Официальный сайт]
+
* [http://python.org Official website]
* Библиотеки: [http://www.numpy.org/ NumPy], [http://pandas.pydata.org/ Pandas], [http://scikit-learn.org/stable/ SciKit-Learn], [http://matplotlib.org/ Matplotlib].
+
* Libraries: [http://www.numpy.org/ NumPy], [http://pandas.pydata.org/ Pandas], [http://scikit-learn.org/stable/ SciKit-Learn], [http://matplotlib.org/ Matplotlib].
* Небольшой пример для начинающих: [http://nbviewer.ipython.org/gist/voron13e02/83a86f2e0fc5e7f8424d краткое руководство с примерами по Python 2]
+
* A little example for the beginners: [http://nbviewer.ipython.org/gist/voron13e02/83a86f2e0fc5e7f8424d краткое руководство с примерами по Python 2]
* Питон с нуля: [http://nbviewer.ipython.org/gist/rpmuller/5920182 A Crash Course in Python for Scientists]
+
* Python from scratch: [http://nbviewer.ipython.org/gist/rpmuller/5920182 A Crash Course in Python for Scientists]
* Лекции [https://github.com/jrjohansson/scientific-python-lectures#online-read-only-versions Scientific Python]
+
* Lectures [https://github.com/jrjohansson/scientific-python-lectures#online-read-only-versions Scientific Python]
* Книга: [http://www.cin.ufpe.br/~embat/Python%20for%20Data%20Analysis.pdf Wes McKinney «Python for Data Analysis»]
+
* A book: [http://www.cin.ufpe.br/~embat/Python%20for%20Data%20Analysis.pdf Wes McKinney «Python for Data Analysis»]
 
* [https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks Коллекция интересных IPython ноутбуков]
 
* [https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks Коллекция интересных IPython ноутбуков]
  
=== Установка и настройка Python ===
+
=== Python installation and configuration ===
 
* [[Анализ данных (Программная инженерия)/Установка и настройка Python#Windows|Windows]]
 
* [[Анализ данных (Программная инженерия)/Установка и настройка Python#Windows|Windows]]
 
* [[Анализ данных (Программная инженерия)/Установка и настройка Python#Mac_OS|Mac OS]]
 
* [[Анализ данных (Программная инженерия)/Установка и настройка Python#Mac_OS|Mac OS]]
 
* [[Анализ данных (Программная инженерия)/Установка и настройка Python#Linux | Linux]]
 
* [[Анализ данных (Программная инженерия)/Установка и настройка Python#Linux | Linux]]

Текущая версия на 10:25, 27 августа 2021

This page is for 2016 year!

Scores and deadlines: here

Class email: cshse.ml@gmail.com

Anonymous overall course evaluation form

Anonymous feedback form: here

Announcements

Kaggle evaluation

Kaggle evaluation is available here. Please check that your work is in the list. Presentations were evaluated using the rules of the competition. In particular - I was expecting to see:

  • that you tried different methods
  • table with accuracy results of each method
  • description how you tuned the parameters of your model (over which grid, with graphs showing accuracy change)
  • data analysis and insights described with illustrative visualizations.
  • in feature selection: quantitive results how each feature could be helpful/not helpful.

Exam questions

Exam questions are published and available here.

Kaggle presentation requirements

You should send presentations before June 3 (Friday) 23-59. Presentations should be sent to v.v.kitov@yandex.ru. The title should be "HSE kaggle presentation <team name>". On the title page of the presentation you should list all team participants. Presentation should be in pdf or ppt format and have all components listed in competition rules. Code in py or ipynb format should also be attached to the letter (it may consist of several files).

Early exam

On June 6th, 13-40 - 16-30 there will two lessons. They will cover: 1) a consultation before exam. Please read through all the material and come with your questions. 2) Presentations of top-3 kaggle teams with their solutions (15 minutes each). Teams with over 60 submissions are welcome to tell their findings in the data - what worked and what not (10 minutes each). Everybody else is also welcome (not obliged) to participate with short presentations (5-10 minutes) and tell interesting findings in the data and non-standard approaches that you tried (not necessarily successful).

For your convenience there will be a possibility to take exam in data analysis earlier - on June 6th at 16-40. To take exam earlier you need to request participation to e-mail v.v.kitov@yandex.ru. The number of participants is limited. Note that earlier exam will be the same as official exam and they mutually exclude each other, so you need to select in which exam to participate. Exam program will be available soon. Earlier exam schedule is proposed for your convenience - to give you the possibility to fully concentrate on preparation to data analysis exam.

Course description

In this class we consider the main problems of data mining and machine learning: classification, clustering, regression, dimensionality reduction, ranking, collaborative filtering. We will also study mathematical methods and concepts which data analysis is based on as well as formal assumptions behind them and various aspects of their implementation.

A significant attention is given to practical skills of data analysis that will be developed on seminars by studying the Python programming language and relevant libraries for scientific computing. To learn more join Python training today.

The knowledge of linear algebra, real analysis and probability theory is required.

The class consists of:

  1. Lectures and seminars
  2. Practical and theoretical homework assignments
  3. A machine learning competition (more information will be available later)
  4. Midterm theoretical colloquium
  5. Final exam

Events outside the course

Universal recomendation system of mail.ru

Description of solutions to different competitions on Kaggle

Neural networks adapt videos to the painting style of famous artists.

Syllabus

  1. Introduction to machine learning.
  2. K-nearest neighbours classification and regression. Extensions. Optimization techniques.
  3. Decision tree methods.
  4. Bayesian decision theory. Model evaluation:
  5. Linear classification methods. Adding regularization to linear methods.
  6. Regression.
  7. Kernel generalization of standard methods.
  8. Neural networks.
  9. Ensemble methods: bagging, boosting, etc.
  10. Feature selection.
  11. Feature extraction
  12. EM algorithm. Density estimation using mixtures.
  13. Clustering
  14. Collaborative filtering
  15. Ranking

Lecture materials

Lecture 1. Introduction to data science and machine learning.

Download

Additional materials: The Field Guide to Data Science, Лекция К.В.Воронцова

Lecture 2. K nearest neighbours method.

Download

Additional materials: Лекция К.В.Воронцова, Metric learning survey

Lecture 3. Decision trees.

Download

Additional materials: Webb, Copsey "Statistical Pattern Recognition", chapter 7.2.

Lecture 4a. Model evaluation.

Binary quality measures. ROC curve, AUC.

Download

Additional materials: Webb, Copsey "Statistical Pattern Recognition", chapter 9.

Lecture 4b. Bayes minimum cost classification.

Case of general losses, common within-class losses and 0,1 losses. Gaussian classifier.

Download

Lecture 5. Linear classifiers.

Discriminant function. Invariance to monotonous transformations for them. Definition for multi-class and binary class cases.

Download

Additional materials: Лекции К.В.Воронцова по линейным методам классификации

Lecture 6. Support vector machines.

Linear separable and linearly non-separable case. Equivalent definition with loss function. Support vectors and non-informative vectors.

Download

Lecture 7. Kernel trick.

Application of kernel trick to SVM. Gaussian, polynomial kernels.

Download

Lecture 8. Regression.

Linear regression and extensions: weighted regression, robust regression, different loss-functions, regression with non-linear features, locally-constant (Nadaraya-Watson) regression.

Download

Lecture 9. Boosting.

Forward stagewise additive modelling. AdaBoost. Gradient boosting.

Download

Additional materials:

Friedman, Hastie, Tibshirani "The Elements of Statistical Learning" - section 10: Boosting and additive trees.,

Мерков "Введение в методы статистического обучения" - секция 4: Линейные комбинации распознавателей.

Lecture 10. Ensemble methods.

Motivation. Bias-variance tradeoff. Bagging, RandomForest, ExtraRandomTrees. Stacking.

Download

Lectures 11, 12. Summary.

Lecture 13. Feature selection.

Download

Lecture 14. Principal components analysis.

Download

Lecture 14. Singular values decomposition.

Download - updated pages 17,18,19.

Lecture 15. Working with text.

Download

Lecture 16. Neural networks.

Download

Lecture 17. Parametric distributions.

Download

Lecture 18. Clustering.

Download

Lecture 19. Mixture densities, EM-algorithm. - updated.

Download

Lecture 20. Recommender systems.

Download

Lecture 21. Kernel density estimation.

Download

Seminars

Seminar 1. Introduction to Data Analysis in Python

Practical task 1, data

Additional materials: 1, 2

Seminar 2. kNN

Theoretical task 2, Practical task 2, data

Additional materials: Visualization tutorial

Seminar 3. Decision trees

Theoretical task 3

Seminar 4. Linear classifiers

Theoretical task 4, Practical task 4, first dataset, diabetes dataset

UPD: At all parts of practical task 4 you should use GD and SGD functions that you program at the fisrt part!

Deadline for this practical task has been changed for some groups! Check it in the table!

Seminar 5. Model evaluation

Theoretical task 5

Seminar 6. Bayesian decision rule

Theoretical task 6, Practical task 6, data

Practical task 6 was completed: the last part was described in more details + there are two small corrections in the first part (they are in bold font). Read it carefully!

Deadline for this practical task has been changed for all groups! Check it in the table!

Seminar 7. SVM and kernel trick

Theoretical task 7

Additional materials: Лекция К.В. Воронцова по SVM

Seminar 8. Regression

Theoretical task 8

Seminar 9. Boosting

Practical task 9, data

Seminar 10. Ensemble methods

Theoretical task 10

Problem 2: a small typo was corrected in the loss function formula.

Seminar 11. Summary

Seminar 12. How to solve practical problems

Dota Competition from the seminar, ipython notebook

Seminar 13. Feature selection

Theoretical task 13, Practical task 13

Practical task is completed.

Seminar 14. Feature extraction

You can read about computing PCA through SVD at the end of this paper.

Seminar 15. Neural networks

Practical task 15, Data, Data in csv format, Censored training set, Theoretical task 15

Additional materials: Backpropagation, PyBrain’s documentation, PyBrain example from the seminar

New data files have been uploaded (there were some problems with reading old ones). Therefore deadline has been changed for some groups! Check it in the table!

If you have MemoryError then read only part of training data from csv files (for example, 30000 objects). You can download censored training set (find link above) or use the following code:

mnist_train = np.loadtxt('mnist_train.csv', delimiter=',')
train_data = ClassificationDataSet(28*28, nb_classes=10)
for i in xrange(len(mnist_train)):

train_data.appendLinked(mnist_train[i, 1:] / 255., int(mnist_train[i, 0]))

train_data._convertToOneOfMany()

mnist_test = np.loadtxt('mnist_test.csv', delimiter=',')
test_data = ClassificationDataSet(28*28, nb_classes=10)
for i in xrange(len(mnist_test)):

test_data.appendLinked(mnist_test[i, 1:] / 255., int(mnist_test[i, 0]))

test_data._convertToOneOfMany()

Seminar 16. Clustering

Theoretical task 16, Practical task 16, parrots.jpg, grass.jpg

Seminar 17. Clustering, EM-algorithm

Theoretical task 17

Seminar 18. Recommender systems

Theoretical task 18,Practical task 18, data

Additional materials: Factorization Machines

Columns in the data: 0 - user, 1 - item, 2 - rating, 3 - time (you don't need this one).

In the practical task you should train models on the train data (base) and evaluate on the test data.

Evaluation criteria

The course lasts during the 3rd and 4th modules. Knowledge of students is assessed by evaluation of their home assignments and exams. Home assignments divide into theoretical tasks and practical tasks. There are two exams during the course – after the 3rd module and after the 4th module respectively. Each of the exams evaluates theoretical knowledge and understanding of the material studied during the respective module.

Grade takes values 4,5,…10. Grades, corresponding to 1,2,3 are assumed unsatisfactory. Exact grades are calculated using the following rule:

  • score ≥ 35% => 4,
  • score ≥ 45% => 5,
  • ...
  • score ≥ 95% => 10,

where score is calculated using the following rule:

score = 0.6 * Shomework + 0.2 * Sexam1 + 0.2 * Sexam2 + 0.2 * Scompetition

  • Shomework – proportion of correctly solved homework,
  • Sexam1 – proportion of successfully answered theoretical questions during exam after module 3,
  • Sexam2 – proportion of successfully answered theoretical questions during exam after module 4,
  • Scompetition – score for the competition in machine learning (it's also from 0 to 1).

Participation in machine learning competition is optional and can give students extra points.

Plagiarism

In case of discovered plagiarism zero points will be set for the home assignemets - for both works, which were found to be identical. In case of repeated plagiarism by one and the same person a report to the dean will be made.

Deadlines

All the deadlines can be found in the second tab here.

We have two deadlines for each assignments: normal and late. An assignment sent prior to normal deadline is scored with no penalty. The maximum score is penalized by 50% for assignments sent in between of the normal and the late deadline. Assignments sent after late deadlines will not be scored (assigned with zero score) in the absence of legitimate reasons for late submission which do not include high load on other classes.

Standard period for working on a homework assignment is 2 and 4 weeks (normal and late deadlines correspondingly) for practical assignments and 1 and 2 weeks for theoretical ones. The first practical assignment is an exception.

Deadline time: 23:59 of the day before seminar (Sunday for students attending Monday seminars and Wednesday for students that have seminars on Thursday).

Structure of emails and homework submissions

All the questions and submissions must be addressed to cshse.ml@gmail.com. The following subjects must be used:

  • For questions (general, regarding assignments, etc): "Question - Surname Name - Group(subgroup)"
  • For homework submissions: "Practice/Theory {Lab number} - Surname Name - Group(subgroup)"

Example: Practice 1 - Ivanov Ivan - 131(1)

If you want to address a particular teacher, mention his name in the subject.

Example: Question - Ivanov Ivan - 131(1) - Ekaterina

Please do not mix two different topics in a single email such as theoretical and practical assignments etc. When replying, please use the same thread (i.e. reply to the same email).

Practical assignments must be implemented in ipython notebook format, theoretical ones in pdf. Practical assignments must use Python 2.7. Use your surname as a filename for assignments (e.g. Ivanov.ipynb). Do not archive your assignments.

Assignments can be performed in either Russian or English.

Assignments can be submitted only once!

Useful links

.

Machine learning

Python

Python installation and configuration