http://wiki.cs.hse.ru/api.php?action=feedcontributions&user=Mhushchyn&feedformat=atomWiki - Факультет компьютерных наук - Вклад участника [ru]2020-09-24T22:14:37ZВклад участникаMediaWiki 1.23.2http://wiki.cs.hse.ru/Data_analysis_(Software_Engineering)_2020Data analysis (Software Engineering) 20202020-06-02T07:14:17Z<p>Mhushchyn: </p>
<hr />
<div>'''[https://join.slack.com/t/hse-se-ml/shared_invite/enQtOTgxOTM5MDE5NTU5LTY5MGI2YWEwYWJkNmM5YmFhNDFkYjIwZjU0MTQyNDNmZDZkOTZmNTE4OGNhOGJlMzMwMDU0ZTc0YjRiYzQyMmY Slack Invite Link] <br /><br />
'''Anonymous feedback form:''' [https://goo.gl/forms/xTfnM328m8ulT4FF2 here]<br /><br />
'''[[ Data_analysis_(Software_Engineering)_2019 | Previous Course Page ]]''' <br /><br />
'''[https://github.com/shestakoff/hse_se_ml Course repo]<br/><br />
'''[https://anytask.org/course/608 Anytask]<br/><br />
<br />
<br />
<br /><br />
<br />
== Course description ==<br />
In this class we consider the main problems of data mining and machine learning: classification, clustering, regression, dimensionality reduction, ranking, collaborative filtering. We will also study mathematical methods and concepts which data analysis is based on as well as formal assumptions behind them and various aspects of their implementation.<br />
<br />
A significant attention is given to practical skills of data analysis that will be developed on seminars by studying the Python programming language and relevant libraries for scientific computing.<br />
<br />
The knowledge of linear algebra, real analysis and probability theory is required.<br />
<br />
'''The class consists of:'''<br />
# Lectures and seminars<br />
# Practical and theoretical homework assignments<br />
# A machine learning competition (more information will be available later)<br />
# Midterm theoretical colloquium<br />
# Final exam<br />
<br />
== Final Exam ==<br />
Final exam will be held on the '''8th of June'''<br />
<br />
Details about timing will be available soon<br />
<br />
Questions list and rules are available [https://cloud.mail.ru/public/HSZw/WhDfGE1CM here].<br />
<br />
== Kaggle ==<br />
Link to competition is in slack<br />
<br />
You should send reports before June 5 23:59 (Competition ends on the 4th of June, late submissions are not considered). <br/><br />
Reports should be sent to the special [https://forms.gle/5QTo7ycrn1zGhg9m8 form]<br />
<br />
Try to follow the format of report template - https://github.com/shestakoff/hse_se_ml/blob/master/2019/kaggle/kaggle-report-template.ipynb<br />
<br />
== Colloquium ==<br />
Colloquium will be held on the '''7th of April''' during seminar<br />
<br />
You may not use any materials during colloquium, except single A4 prepared before the exam and handwritten personally by you (from two sides). You will have 2 questions from [https://github.com/shestakoff/hse_se_ml/raw/master/2020/colloq/colloq-2020.pdf '''question list'''] with 15 minutes for preparation and may receive additional questions or tasks.<br />
<br />
== Course Schedule (3rd module)==<br />
===Lectures===<br />
'''Mondays'''<br />
* 10:30-11:50, Room R205<br />
<br />
== Lecture materials ==<br />
<br />
'''Lecture 1. Introduction to data science and machine learning ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l01-intro/lecture-intro.slides#/ Slides] <br/><br />
<br />
'''Lecture 2. Metric-based methods. K-NN ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l02-knn/lecture-knn.slides#/ Slides] <br/><br />
<br />
'''Lecture 3. Decision Trees''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l03-trees/lecture-trees.slides#/ Slides] <br/><br />
<br />
'''Lecture 4. Linear Regression''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l04-linreg/lecture-linreg.slides#/ Slides] <br/><br />
<br />
'''Lecture 5. Linear Classification''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l05-linclass/lecture-linclass.slides#/ Slides] <br/><br />
<br />
'''Lecture 6. Quality measures''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l06-metrics/lecture-metrics.slides#/ Slides], [https://youtu.be/kItcW-G0wzM record] <br/><br />
<br />
'''Lecture 7. Dimension reductio. PCA''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l07-dimred/lecture-dimred.slides#/ Slides] <br/><br />
<br />
'''Lecture 8. NLP Introduction''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l08-nlp-intro/lecture-nlp-intro.slides#/ Slides] <br/><br />
<br />
'''Lecture 9. Word embeddings''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l09-nlp-w2v/lecture-nlp-w2v.slides#/ Slides] <br/><br />
<br />
'''Lecture 10. Ensembles. Random Forest''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l10-ensembles/lecture-ensemble.slides#/ Slides] <br/><br />
<br />
'''Lecture 11. Ensembles. Boosting''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l11-boosting/lecture-boosting.slides#/ Slides] <br/><br />
<br />
'''Lecture 12. Neural Networks 1''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l12-nn1/lecture-nn1.slides#/ Slides] <br/><br />
<br />
'''Lecture 13. Neural Networks 2''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l13-nn2/lecture-nn2.slides#/ Slides] <br/><br />
<br />
'''Lecture 14. Clustering''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l14-cluster/lecture-clust.slides#/ Slides] <br/><br />
<br />
'''Lecture 15. Recsys''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l15-recsys/lecture-recsys.slides#/ Slides] <br/><br />
<br />
== Seminars ==<br />
<br />
'''Seminar 1. Introduction to Data Analysis in Python '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s01-intro-to-python Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s01-intro-to-python/seminar1-homework.ipynb Homework 1] '''Due Date: 28.01.2020 23:59'''<br/><br />
<br />
'''Seminar 2. Metric-based methods. K-NN'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s02-metric-based-methods%20 Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s02-metric-based-methods%20/seminar2-homework.ipynb Homework 2] '''Due Date: 04.02.2020 23:59'''<br/><br />
<br />
'''Seminar 3. Decision Trees'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s03-decision-trees Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s03-decision-trees//seminar3-homework.ipynb Homework 3] '''Due Date: 01.03.2020 23:59'''<br/><br />
<br />
'''Seminar 4. Linear Regression'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s04-linear-regression Practice in class] <br/><br />
<br />
'''Seminar 5. Logistic Regression'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s05-logistic-regresion/seminar5-logreg.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s05-logistic-regresion/seminar5-homework.ipynb Homework 4] '''Due Date: 22.03.2020 23:59'''<br/><br />
<br />
'''Seminar 6. Quality Measures'''<br/><br />
[https://github.com/matyushinleonid/hse_se_ml/blob/master/2020/s06-quality-measures/seminar6-quality.ipynb Practice in class] <br/><br />
<br />
'''Seminar 7. Dimention Reduction'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s07-dimred/seminar7-dimred.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s07-dimred/seminar7-homework.ipynb Homework 5] '''Due Date: 12.04.2020 23:59'''<br/><br />
<br />
'''Seminar 8. Introduction to NLP'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s08-nlp Practice in class] <br/><br />
[https://www.kaggle.com/c/explicit-content-detection Kaggle 1] '''Due Date: 28.04.2020 23:59'''<br/><br />
<br />
'''Seminar 9. Word2Vec'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s09-word2vec Practice in class] <br/><br />
<br />
'''Seminar 10. Ensembles. Random Forest'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s10-ensembles/seminar10-ensembles.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s10-ensembles/seminar10-homework.ipynb Homework 6] '''Due Date: 12.05.2020 23:59'''<br/><br />
<br />
'''Seminar 11. Boosting'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s11-boosting/seminar.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s11-boosting/homework.ipynb Homework 7] '''Due Date: 19.05.2020 23:59'''<br/><br />
<br />
'''Seminar 12. NN-1'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s12-nn1/seminar12-nn1.ipynb Practice in class] <br/><br />
<br />
'''Seminar 13. NN-2'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s13-nn2/seminar13-nn2.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s13-nn2/seminar13-homework.ipynb Homework 8] '''Due Date: 27.05.2020 23:59'''<br/><br />
<br />
'''Seminar 14. Clustering'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s14-clustering Practice in class] <br/><br />
<br />
'''Seminar 15. RecSys'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s15-recsys Practice in class] <br/><br />
<br />
== Theoretical questions for the colloquium ==<br />
<br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s02-metric-based-methods%20/knn_theory.pdf Metric-based methods. K-NN] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s03-decision-trees/trees_theory.pdf Decision Trees] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s04-linear-regression/linreg_theory.pdf Linear Regression] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s05-logistic-regresion/linclass_theory.pdf Logistic Regression] <br/><br />
[https://github.com/matyushinleonid/hse_se_ml/blob/master/2020/s06-quality-measures/metrics_svm.pdf Quality Measures] <br/><br />
<br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s11-boosting/boosting-theory.pdf Boosting] <br/><br />
<br />
== Evaluation criteria ==<br />
The course lasts during the 3rd and 4th modules. Knowledge of students is assessed by evaluation of their home assignments and exams. There are two exams during the course – after the 3rd module and after the 4th module respectively. Each of the exams evaluates theoretical knowledge and understanding of the material studied during the respective module.<br />
<br />
Grade takes values 4,5,…10. Grades, corresponding to 1,2,3 are assumed unsatisfactory. Exact grades are calculated using the following rule:<br />
<br />
* '''score''' ≥ 35% => 4,<br />
* '''score''' ≥ 45% => 5,<br />
* ...<br />
* '''score''' ≥ 95% => 10,<br />
<br />
where '''score''' is calculated using the following rule:<br />
<br />
'''score''' = 0.7 * S<sub>cumulative</sub> + 0.3 * S<sub>exam2</sub> <br \><br />
'''cumulative score''' = 0.8 * S<sub>homework</sub> + 0.2 * S<sub>exam1</sub> + 0.2 * S<sub>competition</sub><br />
<br />
* S<sub>homework</sub> – proportion of correctly solved homework,<br />
* S<sub>exam1</sub> – proportion of successfully answered theoretical questions during exam after module 3,<br />
* S<sub>exam2</sub> – proportion of successfully answered theoretical questions during exam after module 4,<br />
* S<sub>competition</sub> – score for the competition in machine learning (it's also from 0 to 1).<br />
<br />
Participation in machine learning competition is optional and can give students extra points. <br \><br />
"Automative" passing of the course based on '''cumulative score''' ''may'' be issued.<br />
<br />
'''Kaggle competition 1'''<br/><br />
«Score» = ("your quality"-"baseline method quality") / ("max achieved quality" - "baseline method quality") <br \><br />
'''Required condition:''' a notebook with your best solution must be reproducible. Otherwise, you will not get any score. <br />
<br />
== Plagiarism ==<br />
In case of discovered plagiarism zero points will be set for the home assignemets - for both works, which were found to be identical. In case of repeated plagiarism by one and the same person a report to the dean will be made.<br />
<br />
== Deadlines ==<br />
<br />
Assignments sent after late deadlines will not be scored (assigned with zero score) in the absence of legitimate reasons for late submission which do not include high load on other classes. <br />
<br />
== Structure of emails and homework submissions ==<br />
Practical assignments must be implemented in Jupyter Notebook format, theoretical ones in pdf. Practical assignments must use '''Python 3''' (or Python 3 compatible). Use your surname as a filename for assignments (e.g. Ivanov.ipynb). Do not archive your assignments.<br />
<br />
Assignments can be performed in either Russian or English.<br />
<br />
'''Assignments can be submitted only once!'''<br />
<br />
Link for the submissions: '''[https://anytask.org/course/608 Anytask.]<br/><br />
<br />
== Useful links ==<br />
=== Machine learning, Stats, Maths ===<br />
* [https://github.com/esokolov/ml-course-hse Machine learning course from Evgeny Sokolov on Github]<br />
* [http://www.machinelearning.ru/wiki/index.php?title=Заглавная_страница machinelearning.ru]<br />
* [https://yandexdataschool.ru/edu-process/courses/machine-learning Video-lectures of K. Vorontsov on machine learning]<br />
* [https://www.dataoptimal.com/wp-content/uploads/Data-Science-Books-for-2018.pdf Some books for ML1]<br />
* [https://anvaka.github.io/greview/hands-on-ml/1/ Some books for ML2]<br />
* [https://mml-book.github.io/ Math for ML]<br />
* One of classic ML books. [https://web.stanford.edu/~hastie/Papers/ESLII.pdf Elements of Statistical Learning (Trevor Hastie, Robert Tibshirani, Jerome Friedman)]<br />
* [http://immersivemath.com/ila/learnmore.html Linear Algebra Immersive book]<br />
<br />
=== Python ===<br />
* [http://python.org Official website]<br />
* Libraries: [http://www.numpy.org/ NumPy], [http://pandas.pydata.org/ Pandas], [http://scikit-learn.org/stable/ SciKit-Learn], [http://matplotlib.org/ Matplotlib].<br />
* A little example for the begginers: [http://nbviewer.ipython.org/gist/voron13e02/83a86f2e0fc5e7f8424d краткое руководство с примерами по Python 2]<br />
* Python from scratch: [http://nbviewer.ipython.org/gist/rpmuller/5920182 A Crash Course in Python for Scientists]<br />
* Lectures [https://github.com/jrjohansson/scientific-python-lectures#online-read-only-versions Scientific Python]<br />
* A book: [http://www.cin.ufpe.br/~embat/Python%20for%20Data%20Analysis.pdf Wes McKinney «Python for Data Analysis»]<br />
* [https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks Коллекция интересных IPython ноутбуков]<br />
<br />
=== Python installation and configuration ===<br />
<br />
[https://www.continuum.io/downloads anaconda]</div>Mhushchynhttp://wiki.cs.hse.ru/Data_analysis_(Software_Engineering)_2020Data analysis (Software Engineering) 20202020-05-25T18:51:19Z<p>Mhushchyn: </p>
<hr />
<div>'''[https://join.slack.com/t/hse-se-ml/shared_invite/enQtOTgxOTM5MDE5NTU5LTY5MGI2YWEwYWJkNmM5YmFhNDFkYjIwZjU0MTQyNDNmZDZkOTZmNTE4OGNhOGJlMzMwMDU0ZTc0YjRiYzQyMmY Slack Invite Link] <br /><br />
'''Anonymous feedback form:''' [https://goo.gl/forms/xTfnM328m8ulT4FF2 here]<br /><br />
'''[[ Data_analysis_(Software_Engineering)_2019 | Previous Course Page ]]''' <br /><br />
'''[https://github.com/shestakoff/hse_se_ml Course repo]<br/><br />
'''[https://anytask.org/course/608 Anytask]<br/><br />
<br />
<br />
<br /><br />
<br />
== Course description ==<br />
In this class we consider the main problems of data mining and machine learning: classification, clustering, regression, dimensionality reduction, ranking, collaborative filtering. We will also study mathematical methods and concepts which data analysis is based on as well as formal assumptions behind them and various aspects of their implementation.<br />
<br />
A significant attention is given to practical skills of data analysis that will be developed on seminars by studying the Python programming language and relevant libraries for scientific computing.<br />
<br />
The knowledge of linear algebra, real analysis and probability theory is required.<br />
<br />
'''The class consists of:'''<br />
# Lectures and seminars<br />
# Practical and theoretical homework assignments<br />
# A machine learning competition (more information will be available later)<br />
# Midterm theoretical colloquium<br />
# Final exam<br />
<br />
== Kaggle ==<br />
Link to competition is in slack<br />
<br />
You should send reports before June 5 23:59 (Competition ends on the 4th of June, late submissions are not considered). Reports should be sent to the special form, that is going to be provided soon. <br />
Try to follow the format of report template - https://github.com/shestakoff/hse_se_ml/blob/master/2019/kaggle/kaggle-report-template.ipynb<br />
<br />
<br />
== Colloquium ==<br />
Colloquium will be held on the '''7th of April''' during seminar<br />
<br />
You may not use any materials during colloquium, except single A4 prepared before the exam and handwritten personally by you (from two sides). You will have 2 questions from [https://github.com/shestakoff/hse_se_ml/raw/master/2020/colloq/colloq-2020.pdf '''question list'''] with 15 minutes for preparation and may receive additional questions or tasks.<br />
<br />
== Course Schedule (3rd module)==<br />
===Lectures===<br />
'''Mondays'''<br />
* 10:30-11:50, Room R205<br />
<br />
== Lecture materials ==<br />
<br />
'''Lecture 1. Introduction to data science and machine learning ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l01-intro/lecture-intro.slides#/ Slides] <br/><br />
<br />
'''Lecture 2. Metric-based methods. K-NN ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l02-knn/lecture-knn.slides#/ Slides] <br/><br />
<br />
'''Lecture 3. Decision Trees''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l03-trees/lecture-trees.slides#/ Slides] <br/><br />
<br />
'''Lecture 4. Linear Regression''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l04-linreg/lecture-linreg.slides#/ Slides] <br/><br />
<br />
'''Lecture 5. Linear Classification''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l05-linclass/lecture-linclass.slides#/ Slides] <br/><br />
<br />
'''Lecture 6. Quality measures''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l06-metrics/lecture-metrics.slides#/ Slides], [https://youtu.be/kItcW-G0wzM record] <br/><br />
<br />
'''Lecture 7. Dimension reductio. PCA''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l07-dimred/lecture-dimred.slides#/ Slides] <br/><br />
<br />
'''Lecture 8. NLP Introduction''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l08-nlp-intro/lecture-nlp-intro.slides#/ Slides] <br/><br />
<br />
'''Lecture 9. Word embeddings''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l09-nlp-w2v/lecture-nlp-w2v.slides#/ Slides] <br/><br />
<br />
'''Lecture 10. Ensembles. Random Forest''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l10-ensembles/lecture-ensemble.slides#/ Slides] <br/><br />
<br />
'''Lecture 11. Ensembles. Boosting''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l11-boosting/lecture-boosting.slides#/ Slides] <br/><br />
<br />
'''Lecture 12. Neural Networks 1''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l12-nn1/lecture-nn1.slides#/ Slides] <br/><br />
<br />
'''Lecture 13. Neural Networks 2''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l13-nn2/lecture-nn2.slides#/ Slides] <br/><br />
<br />
'''Lecture 14. Clustering''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l14-cluster/lecture-clust.slides#/ Slides] <br/><br />
<br />
== Seminars ==<br />
<br />
'''Seminar 1. Introduction to Data Analysis in Python '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s01-intro-to-python Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s01-intro-to-python/seminar1-homework.ipynb Homework 1] '''Due Date: 28.01.2020 23:59'''<br/><br />
<br />
'''Seminar 2. Metric-based methods. K-NN'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s02-metric-based-methods%20 Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s02-metric-based-methods%20/seminar2-homework.ipynb Homework 2] '''Due Date: 04.02.2020 23:59'''<br/><br />
<br />
'''Seminar 3. Decision Trees'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s03-decision-trees Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s03-decision-trees//seminar3-homework.ipynb Homework 3] '''Due Date: 01.03.2020 23:59'''<br/><br />
<br />
'''Seminar 4. Linear Regression'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s04-linear-regression Practice in class] <br/><br />
<br />
'''Seminar 5. Logistic Regression'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s05-logistic-regresion/seminar5-logreg.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s05-logistic-regresion/seminar5-homework.ipynb Homework 4] '''Due Date: 22.03.2020 23:59'''<br/><br />
<br />
'''Seminar 6. Quality Measures'''<br/><br />
[https://github.com/matyushinleonid/hse_se_ml/blob/master/2020/s06-quality-measures/seminar6-quality.ipynb Practice in class] <br/><br />
<br />
'''Seminar 7. Dimention Reduction'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s07-dimred/seminar7-dimred.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s07-dimred/seminar7-homework.ipynb Homework 5] '''Due Date: 12.04.2020 23:59'''<br/><br />
<br />
'''Seminar 8. Introduction to NLP'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s08-nlp Practice in class] <br/><br />
[https://www.kaggle.com/c/explicit-content-detection Kaggle 1] '''Due Date: 28.04.2020 23:59'''<br/><br />
<br />
'''Seminar 9. Word2Vec'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s09-word2vec Practice in class] <br/><br />
<br />
'''Seminar 10. Ensembles. Random Forest'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s10-ensembles/seminar10-ensembles.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s10-ensembles/seminar10-homework.ipynb Homework 6] '''Due Date: 12.05.2020 23:59'''<br/><br />
<br />
'''Seminar 11. Boosting'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s11-boosting/seminar.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s11-boosting/homework.ipynb Homework 7] '''Due Date: 19.05.2020 23:59'''<br/><br />
<br />
'''Seminar 12. NN-1'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s12-nn1/seminar12-nn1.ipynb Practice in class] <br/><br />
<br />
'''Seminar 13. NN-2'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s13-nn2/seminar13-nn2.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s13-nn2/seminar13-homework.ipynb Homework 8] '''Due Date: 26.05.2020 23:59'''<br/><br />
<br />
'''Seminar 14. Clustering'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s14-clustering Practice in class] <br/><br />
<br />
== Theoretical questions for the colloquium ==<br />
<br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s02-metric-based-methods%20/knn_theory.pdf Metric-based methods. K-NN] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s03-decision-trees/trees_theory.pdf Decision Trees] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s04-linear-regression/linreg_theory.pdf Linear Regression] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s05-logistic-regresion/linclass_theory.pdf Logistic Regression] <br/><br />
[https://github.com/matyushinleonid/hse_se_ml/blob/master/2020/s06-quality-measures/metrics_svm.pdf Quality Measures] <br/><br />
<br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s11-boosting/boosting-theory.pdf Boosting] <br/><br />
<br />
== Evaluation criteria ==<br />
The course lasts during the 3rd and 4th modules. Knowledge of students is assessed by evaluation of their home assignments and exams. There are two exams during the course – after the 3rd module and after the 4th module respectively. Each of the exams evaluates theoretical knowledge and understanding of the material studied during the respective module.<br />
<br />
Grade takes values 4,5,…10. Grades, corresponding to 1,2,3 are assumed unsatisfactory. Exact grades are calculated using the following rule:<br />
<br />
* '''score''' ≥ 35% => 4,<br />
* '''score''' ≥ 45% => 5,<br />
* ...<br />
* '''score''' ≥ 95% => 10,<br />
<br />
where '''score''' is calculated using the following rule:<br />
<br />
'''score''' = 0.7 * S<sub>cumulative</sub> + 0.3 * S<sub>exam2</sub> <br \><br />
'''cumulative score''' = 0.8 * S<sub>homework</sub> + 0.2 * S<sub>exam1</sub> + 0.2 * S<sub>competition</sub><br />
<br />
* S<sub>homework</sub> – proportion of correctly solved homework,<br />
* S<sub>exam1</sub> – proportion of successfully answered theoretical questions during exam after module 3,<br />
* S<sub>exam2</sub> – proportion of successfully answered theoretical questions during exam after module 4,<br />
* S<sub>competition</sub> – score for the competition in machine learning (it's also from 0 to 1).<br />
<br />
Participation in machine learning competition is optional and can give students extra points. <br \><br />
"Automative" passing of the course based on '''cumulative score''' ''may'' be issued.<br />
<br />
'''Kaggle competition 1'''<br/><br />
«Score» = ("your quality"-"baseline method quality") / ("max achieved quality" - "baseline method quality") <br \><br />
'''Required condition:''' a notebook with your best solution must be reproducible. Otherwise, you will not get any score. <br />
<br />
== Plagiarism ==<br />
In case of discovered plagiarism zero points will be set for the home assignemets - for both works, which were found to be identical. In case of repeated plagiarism by one and the same person a report to the dean will be made.<br />
<br />
== Deadlines ==<br />
<br />
Assignments sent after late deadlines will not be scored (assigned with zero score) in the absence of legitimate reasons for late submission which do not include high load on other classes. <br />
<br />
== Structure of emails and homework submissions ==<br />
Practical assignments must be implemented in Jupyter Notebook format, theoretical ones in pdf. Practical assignments must use '''Python 3''' (or Python 3 compatible). Use your surname as a filename for assignments (e.g. Ivanov.ipynb). Do not archive your assignments.<br />
<br />
Assignments can be performed in either Russian or English.<br />
<br />
'''Assignments can be submitted only once!'''<br />
<br />
Link for the submissions: '''[https://anytask.org/course/608 Anytask.]<br/><br />
<br />
== Useful links ==<br />
=== Machine learning, Stats, Maths ===<br />
* [https://github.com/esokolov/ml-course-hse Machine learning course from Evgeny Sokolov on Github]<br />
* [http://www.machinelearning.ru/wiki/index.php?title=Заглавная_страница machinelearning.ru]<br />
* [https://yandexdataschool.ru/edu-process/courses/machine-learning Video-lectures of K. Vorontsov on machine learning]<br />
* [https://www.dataoptimal.com/wp-content/uploads/Data-Science-Books-for-2018.pdf Some books for ML1]<br />
* [https://anvaka.github.io/greview/hands-on-ml/1/ Some books for ML2]<br />
* [https://mml-book.github.io/ Math for ML]<br />
* One of classic ML books. [https://web.stanford.edu/~hastie/Papers/ESLII.pdf Elements of Statistical Learning (Trevor Hastie, Robert Tibshirani, Jerome Friedman)]<br />
* [http://immersivemath.com/ila/learnmore.html Linear Algebra Immersive book]<br />
<br />
=== Python ===<br />
* [http://python.org Official website]<br />
* Libraries: [http://www.numpy.org/ NumPy], [http://pandas.pydata.org/ Pandas], [http://scikit-learn.org/stable/ SciKit-Learn], [http://matplotlib.org/ Matplotlib].<br />
* A little example for the begginers: [http://nbviewer.ipython.org/gist/voron13e02/83a86f2e0fc5e7f8424d краткое руководство с примерами по Python 2]<br />
* Python from scratch: [http://nbviewer.ipython.org/gist/rpmuller/5920182 A Crash Course in Python for Scientists]<br />
* Lectures [https://github.com/jrjohansson/scientific-python-lectures#online-read-only-versions Scientific Python]<br />
* A book: [http://www.cin.ufpe.br/~embat/Python%20for%20Data%20Analysis.pdf Wes McKinney «Python for Data Analysis»]<br />
* [https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks Коллекция интересных IPython ноутбуков]<br />
<br />
=== Python installation and configuration ===<br />
<br />
[https://www.continuum.io/downloads anaconda]</div>Mhushchynhttp://wiki.cs.hse.ru/Data_analysis_(Software_Engineering)_2020Data analysis (Software Engineering) 20202020-05-01T08:51:56Z<p>Mhushchyn: </p>
<hr />
<div>'''[https://join.slack.com/t/hse-se-ml/shared_invite/enQtOTgxOTM5MDE5NTU5LTY5MGI2YWEwYWJkNmM5YmFhNDFkYjIwZjU0MTQyNDNmZDZkOTZmNTE4OGNhOGJlMzMwMDU0ZTc0YjRiYzQyMmY Slack Invite Link] <br /><br />
'''Anonymous feedback form:''' [https://goo.gl/forms/xTfnM328m8ulT4FF2 here]<br /><br />
'''[[ Data_analysis_(Software_Engineering)_2019 | Previous Course Page ]]''' <br /><br />
'''[https://github.com/shestakoff/hse_se_ml Course repo]<br/><br />
'''[https://anytask.org/course/608 Anytask]<br/><br />
<br />
<br />
<br /><br />
<br />
== Course description ==<br />
In this class we consider the main problems of data mining and machine learning: classification, clustering, regression, dimensionality reduction, ranking, collaborative filtering. We will also study mathematical methods and concepts which data analysis is based on as well as formal assumptions behind them and various aspects of their implementation.<br />
<br />
A significant attention is given to practical skills of data analysis that will be developed on seminars by studying the Python programming language and relevant libraries for scientific computing.<br />
<br />
The knowledge of linear algebra, real analysis and probability theory is required.<br />
<br />
'''The class consists of:'''<br />
# Lectures and seminars<br />
# Practical and theoretical homework assignments<br />
# A machine learning competition (more information will be available later)<br />
# Midterm theoretical colloquium<br />
# Final exam<br />
<br />
== Colloquium ==<br />
Colloquium will be held on the '''7th of April''' during seminar<br />
<br />
You may not use any materials during colloquium, except single A4 prepared before the exam and handwritten personally by you (from two sides). You will have 2 questions from [https://github.com/shestakoff/hse_se_ml/raw/master/2020/colloq/colloq-2020.pdf '''question list'''] with 15 minutes for preparation and may receive additional questions or tasks.<br />
<br />
== Course Schedule (3rd module)==<br />
===Lectures===<br />
'''Mondays'''<br />
* 10:30-11:50, Room R205<br />
<br />
== Lecture materials ==<br />
<br />
'''Lecture 1. Introduction to data science and machine learning ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l01-intro/lecture-intro.slides#/ Slides] <br/><br />
<br />
'''Lecture 2. Metric-based methods. K-NN ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l02-knn/lecture-knn.slides#/ Slides] <br/><br />
<br />
'''Lecture 3. Decision Trees''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l03-trees/lecture-trees.slides#/ Slides] <br/><br />
<br />
'''Lecture 4. Linear Regression''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l04-linreg/lecture-linreg.slides#/ Slides] <br/><br />
<br />
'''Lecture 5. Linear Classification''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l05-linclass/lecture-linclass.slides#/ Slides] <br/><br />
<br />
'''Lecture 6. Quality measures''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l06-metrics/lecture-metrics.slides#/ Slides], [https://youtu.be/kItcW-G0wzM record] <br/><br />
<br />
'''Lecture 7. Dimension reductio. PCA''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l07-dimred/lecture-dimred.slides#/ Slides] <br/><br />
<br />
'''Lecture 8. NLP Introduction''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l08-nlp-intro/lecture-nlp-intro.slides#/ Slides] <br/><br />
<br />
'''Lecture 9. Word embeddings''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l09-nlp-w2v/lecture-nlp-w2v.slides#/ Slides] <br/><br />
<br />
'''Lecture 10. Ensembles. Random Forest''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l10-ensembles/lecture-ensemble.slides#/ Slides] <br/><br />
<br />
== Seminars ==<br />
<br />
'''Seminar 1. Introduction to Data Analysis in Python '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s01-intro-to-python Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s01-intro-to-python/seminar1-homework.ipynb Homework 1] '''Due Date: 28.01.2020 23:59'''<br/><br />
<br />
'''Seminar 2. Metric-based methods. K-NN'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s02-metric-based-methods%20 Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s02-metric-based-methods%20/seminar2-homework.ipynb Homework 2] '''Due Date: 04.02.2020 23:59'''<br/><br />
<br />
'''Seminar 3. Decision Trees'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s03-decision-trees Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s03-decision-trees//seminar3-homework.ipynb Homework 3] '''Due Date: 01.03.2020 23:59'''<br/><br />
<br />
'''Seminar 4. Linear Regression'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s04-linear-regression Practice in class] <br/><br />
<br />
'''Seminar 5. Logistic Regression'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s05-logistic-regresion/seminar5-logreg.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s05-logistic-regresion/seminar5-homework.ipynb Homework 4] '''Due Date: 22.03.2020 23:59'''<br/><br />
<br />
'''Seminar 6. Quality Measures'''<br/><br />
[https://github.com/matyushinleonid/hse_se_ml/blob/master/2020/s06-quality-measures/seminar6-quality.ipynb Practice in class] <br/><br />
<br />
'''Seminar 7. Dimention Reduction'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s07-dimred/seminar7-dimred.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s07-dimred/seminar7-homework.ipynb Homework 5] '''Due Date: 12.04.2020 23:59'''<br/><br />
<br />
'''Seminar 8. Introduction to NLP'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s08-nlp Practice in class] <br/><br />
[https://www.kaggle.com/c/explicit-content-detection Kaggle 1] '''Due Date: 28.04.2020 23:59'''<br/><br />
<br />
'''Seminar 9. Word2Vec'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s09-word2vec Practice in class] <br/><br />
<br />
'''Seminar 10. Ensembles. Random Forest'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s10-ensembles/seminar10-ensembles.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s10-ensembles/seminar10-homework.ipynb Homework 6] '''Due Date: 12.05.2020 23:59'''<br/><br />
<br />
== Theoretical questions for the colloquium ==<br />
<br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s02-metric-based-methods%20/knn_theory.pdf Metric-based methods. K-NN] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s03-decision-trees/trees_theory.pdf Decision Trees] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s04-linear-regression/linreg_theory.pdf Linear Regression] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s05-logistic-regresion/linclass_theory.pdf Logistic Regression] <br/><br />
[https://github.com/matyushinleonid/hse_se_ml/blob/master/2020/s06-quality-measures/metrics_svm.pdf Quality Measures] <br/><br />
<br />
== Evaluation criteria ==<br />
The course lasts during the 3rd and 4th modules. Knowledge of students is assessed by evaluation of their home assignments and exams. There are two exams during the course – after the 3rd module and after the 4th module respectively. Each of the exams evaluates theoretical knowledge and understanding of the material studied during the respective module.<br />
<br />
Grade takes values 4,5,…10. Grades, corresponding to 1,2,3 are assumed unsatisfactory. Exact grades are calculated using the following rule:<br />
<br />
* '''score''' ≥ 35% => 4,<br />
* '''score''' ≥ 45% => 5,<br />
* ...<br />
* '''score''' ≥ 95% => 10,<br />
<br />
where '''score''' is calculated using the following rule:<br />
<br />
'''score''' = 0.7 * S<sub>cumulative</sub> + 0.3 * S<sub>exam2</sub> <br \><br />
'''cumulative score''' = 0.8 * S<sub>homework</sub> + 0.2 * S<sub>exam1</sub> + 0.2 * S<sub>competition</sub><br />
<br />
* S<sub>homework</sub> – proportion of correctly solved homework,<br />
* S<sub>exam1</sub> – proportion of successfully answered theoretical questions during exam after module 3,<br />
* S<sub>exam2</sub> – proportion of successfully answered theoretical questions during exam after module 4,<br />
* S<sub>competition</sub> – score for the competition in machine learning (it's also from 0 to 1).<br />
<br />
Participation in machine learning competition is optional and can give students extra points. <br \><br />
"Automative" passing of the course based on '''cumulative score''' ''may'' be issued.<br />
<br />
'''Kaggle competition 1'''<br/><br />
«Score» = ("your quality"-"baseline method quality") / ("max achieved quality" - "baseline method quality") <br \><br />
'''Required condition:''' a notebook with your best solution must be reproducible. Otherwise, you will not get any score. <br />
<br />
== Plagiarism ==<br />
In case of discovered plagiarism zero points will be set for the home assignemets - for both works, which were found to be identical. In case of repeated plagiarism by one and the same person a report to the dean will be made.<br />
<br />
== Deadlines ==<br />
<br />
Assignments sent after late deadlines will not be scored (assigned with zero score) in the absence of legitimate reasons for late submission which do not include high load on other classes. <br />
<br />
== Structure of emails and homework submissions ==<br />
Practical assignments must be implemented in Jupyter Notebook format, theoretical ones in pdf. Practical assignments must use '''Python 3''' (or Python 3 compatible). Use your surname as a filename for assignments (e.g. Ivanov.ipynb). Do not archive your assignments.<br />
<br />
Assignments can be performed in either Russian or English.<br />
<br />
'''Assignments can be submitted only once!'''<br />
<br />
Link for the submissions: '''[https://anytask.org/course/608 Anytask.]<br/><br />
<br />
== Useful links ==<br />
=== Machine learning, Stats, Maths ===<br />
* [https://github.com/esokolov/ml-course-hse Machine learning course from Evgeny Sokolov on Github]<br />
* [http://www.machinelearning.ru/wiki/index.php?title=Заглавная_страница machinelearning.ru]<br />
* [https://yandexdataschool.ru/edu-process/courses/machine-learning Video-lectures of K. Vorontsov on machine learning]<br />
* [https://www.dataoptimal.com/wp-content/uploads/Data-Science-Books-for-2018.pdf Some books for ML1]<br />
* [https://anvaka.github.io/greview/hands-on-ml/1/ Some books for ML2]<br />
* [https://mml-book.github.io/ Math for ML]<br />
* One of classic ML books. [https://web.stanford.edu/~hastie/Papers/ESLII.pdf Elements of Statistical Learning (Trevor Hastie, Robert Tibshirani, Jerome Friedman)]<br />
* [http://immersivemath.com/ila/learnmore.html Linear Algebra Immersive book]<br />
<br />
=== Python ===<br />
* [http://python.org Official website]<br />
* Libraries: [http://www.numpy.org/ NumPy], [http://pandas.pydata.org/ Pandas], [http://scikit-learn.org/stable/ SciKit-Learn], [http://matplotlib.org/ Matplotlib].<br />
* A little example for the begginers: [http://nbviewer.ipython.org/gist/voron13e02/83a86f2e0fc5e7f8424d краткое руководство с примерами по Python 2]<br />
* Python from scratch: [http://nbviewer.ipython.org/gist/rpmuller/5920182 A Crash Course in Python for Scientists]<br />
* Lectures [https://github.com/jrjohansson/scientific-python-lectures#online-read-only-versions Scientific Python]<br />
* A book: [http://www.cin.ufpe.br/~embat/Python%20for%20Data%20Analysis.pdf Wes McKinney «Python for Data Analysis»]<br />
* [https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks Коллекция интересных IPython ноутбуков]<br />
<br />
=== Python installation and configuration ===<br />
<br />
[https://www.continuum.io/downloads anaconda]</div>Mhushchynhttp://wiki.cs.hse.ru/Data_analysis_(Software_Engineering)_2020Data analysis (Software Engineering) 20202020-04-27T21:54:53Z<p>Mhushchyn: </p>
<hr />
<div>'''[https://join.slack.com/t/hse-se-ml/shared_invite/enQtOTgxOTM5MDE5NTU5LTY5MGI2YWEwYWJkNmM5YmFhNDFkYjIwZjU0MTQyNDNmZDZkOTZmNTE4OGNhOGJlMzMwMDU0ZTc0YjRiYzQyMmY Slack Invite Link] <br /><br />
'''Anonymous feedback form:''' [https://goo.gl/forms/xTfnM328m8ulT4FF2 here]<br /><br />
'''[[ Data_analysis_(Software_Engineering)_2019 | Previous Course Page ]]''' <br /><br />
'''[https://github.com/shestakoff/hse_se_ml Course repo]<br/><br />
<br />
<br />
<br /><br />
<br />
== Course description ==<br />
In this class we consider the main problems of data mining and machine learning: classification, clustering, regression, dimensionality reduction, ranking, collaborative filtering. We will also study mathematical methods and concepts which data analysis is based on as well as formal assumptions behind them and various aspects of their implementation.<br />
<br />
A significant attention is given to practical skills of data analysis that will be developed on seminars by studying the Python programming language and relevant libraries for scientific computing.<br />
<br />
The knowledge of linear algebra, real analysis and probability theory is required.<br />
<br />
'''The class consists of:'''<br />
# Lectures and seminars<br />
# Practical and theoretical homework assignments<br />
# A machine learning competition (more information will be available later)<br />
# Midterm theoretical colloquium<br />
# Final exam<br />
<br />
== Colloquium ==<br />
Colloquium will be held on the '''7th of April''' during seminar<br />
<br />
You may not use any materials during colloquium, except single A4 prepared before the exam and handwritten personally by you (from two sides). You will have 2 questions from [https://github.com/shestakoff/hse_se_ml/raw/master/2020/colloq/colloq-2020.pdf '''question list'''] with 15 minutes for preparation and may receive additional questions or tasks.<br />
<br />
== Course Schedule (3rd module)==<br />
===Lectures===<br />
'''Mondays'''<br />
* 10:30-11:50, Room R205<br />
<br />
== Lecture materials ==<br />
<br />
'''Lecture 1. Introduction to data science and machine learning ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l01-intro/lecture-intro.slides#/ Slides] <br/><br />
<br />
'''Lecture 2. Metric-based methods. K-NN ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l02-knn/lecture-knn.slides#/ Slides] <br/><br />
<br />
'''Lecture 3. Decision Trees''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l03-trees/lecture-trees.slides#/ Slides] <br/><br />
<br />
'''Lecture 4. Linear Regression''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l04-linreg/lecture-linreg.slides#/ Slides] <br/><br />
<br />
'''Lecture 5. Linear Classification''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l05-linclass/lecture-linclass.slides#/ Slides] <br/><br />
<br />
'''Lecture 6. Quality measures''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l06-metrics/lecture-metrics.slides#/ Slides], [https://youtu.be/kItcW-G0wzM record] <br/><br />
<br />
'''Lecture 7. Dimension reductio. PCA''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l07-dimred/lecture-dimred.slides#/ Slides] <br/><br />
<br />
'''Lecture 8. NLP Introduction''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l08-nlp-intro/lecture-nlp-intro.slides#/ Slides] <br/><br />
<br />
'''Lecture 9. Word embeddings''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l09-nlp-w2v/lecture-nlp-w2v.slides#/ Slides] <br/><br />
<br />
'''Lecture 10. Ensembles. Random Forest''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l10-ensembles/lecture-ensemble.slides#/ Slides] <br/><br />
<br />
== Seminars ==<br />
<br />
'''Seminar 1. Introduction to Data Analysis in Python '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s01-intro-to-python Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s01-intro-to-python/seminar1-homework.ipynb Homework 1] '''Due Date: 28.01.2020 23:59'''<br/><br />
<br />
'''Seminar 2. Metric-based methods. K-NN'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s02-metric-based-methods%20 Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s02-metric-based-methods%20/seminar2-homework.ipynb Homework 2] '''Due Date: 04.02.2020 23:59'''<br/><br />
<br />
'''Seminar 3. Decision Trees'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s03-decision-trees Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s03-decision-trees//seminar3-homework.ipynb Homework 3] '''Due Date: 01.03.2020 23:59'''<br/><br />
<br />
'''Seminar 4. Linear Regression'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s04-linear-regression Practice in class] <br/><br />
<br />
'''Seminar 5. Logistic Regression'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s05-logistic-regresion/seminar5-logreg.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s05-logistic-regresion/seminar5-homework.ipynb Homework 4] '''Due Date: 22.03.2020 23:59'''<br/><br />
<br />
'''Seminar 6. Quality Measures'''<br/><br />
[https://github.com/matyushinleonid/hse_se_ml/blob/master/2020/s06-quality-measures/seminar6-quality.ipynb Practice in class] <br/><br />
<br />
'''Seminar 7. Dimention Reduction'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s07-dimred/seminar7-dimred.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s07-dimred/seminar7-homework.ipynb Homework 5] '''Due Date: 12.04.2020 23:59'''<br/><br />
<br />
'''Seminar 8. Introduction to NLP'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s08-nlp Practice in class] <br/><br />
[https://www.kaggle.com/c/explicit-content-detection Kaggle 1] '''Due Date: 28.04.2020 23:59'''<br/><br />
<br />
'''Seminar 9. Word2Vec'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s09-word2vec Practice in class] <br/><br />
<br />
'''Seminar 10. Ensembles. Random Forest'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s10-ensembles/seminar10-ensembles.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s10-ensembles/seminar10-homework.ipynb Homework 6] '''Due Date: 12.05.2020 23:59'''<br/><br />
<br />
== Theoretical questions for the colloquium ==<br />
<br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s02-metric-based-methods%20/knn_theory.pdf Metric-based methods. K-NN] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s03-decision-trees/trees_theory.pdf Decision Trees] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s04-linear-regression/linreg_theory.pdf Linear Regression] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s05-logistic-regresion/linclass_theory.pdf Logistic Regression] <br/><br />
[https://github.com/matyushinleonid/hse_se_ml/blob/master/2020/s06-quality-measures/metrics_svm.pdf Quality Measures] <br/><br />
<br />
== Evaluation criteria ==<br />
The course lasts during the 3rd and 4th modules. Knowledge of students is assessed by evaluation of their home assignments and exams. There are two exams during the course – after the 3rd module and after the 4th module respectively. Each of the exams evaluates theoretical knowledge and understanding of the material studied during the respective module.<br />
<br />
Grade takes values 4,5,…10. Grades, corresponding to 1,2,3 are assumed unsatisfactory. Exact grades are calculated using the following rule:<br />
<br />
* '''score''' ≥ 35% => 4,<br />
* '''score''' ≥ 45% => 5,<br />
* ...<br />
* '''score''' ≥ 95% => 10,<br />
<br />
where '''score''' is calculated using the following rule:<br />
<br />
'''score''' = 0.7 * S<sub>cumulative</sub> + 0.3 * S<sub>exam2</sub> <br \><br />
'''cumulative score''' = 0.8 * S<sub>homework</sub> + 0.2 * S<sub>exam1</sub> + 0.2 * S<sub>competition</sub><br />
<br />
* S<sub>homework</sub> – proportion of correctly solved homework,<br />
* S<sub>exam1</sub> – proportion of successfully answered theoretical questions during exam after module 3,<br />
* S<sub>exam2</sub> – proportion of successfully answered theoretical questions during exam after module 4,<br />
* S<sub>competition</sub> – score for the competition in machine learning (it's also from 0 to 1).<br />
<br />
Participation in machine learning competition is optional and can give students extra points. <br \><br />
"Automative" passing of the course based on '''cumulative score''' ''may'' be issued.<br />
<br />
'''Kaggle competition 1'''<br/><br />
«Score» = ("your quality"-"baseline method quality") / ("max achieved quality" - "baseline method quality") <br \><br />
'''Required condition:''' a notebook with your best solution must be reproducible. Otherwise, you will not get any score. <br />
<br />
== Plagiarism ==<br />
In case of discovered plagiarism zero points will be set for the home assignemets - for both works, which were found to be identical. In case of repeated plagiarism by one and the same person a report to the dean will be made.<br />
<br />
== Deadlines ==<br />
<br />
Assignments sent after late deadlines will not be scored (assigned with zero score) in the absence of legitimate reasons for late submission which do not include high load on other classes. <br />
<br />
== Structure of emails and homework submissions ==<br />
Practical assignments must be implemented in Jupyter Notebook format, theoretical ones in pdf. Practical assignments must use '''Python 3''' (or Python 3 compatible). Use your surname as a filename for assignments (e.g. Ivanov.ipynb). Do not archive your assignments.<br />
<br />
Assignments can be performed in either Russian or English.<br />
<br />
'''Assignments can be submitted only once!'''<br />
<br />
== Useful links ==<br />
=== Machine learning, Stats, Maths ===<br />
* [https://github.com/esokolov/ml-course-hse Machine learning course from Evgeny Sokolov on Github]<br />
* [http://www.machinelearning.ru/wiki/index.php?title=Заглавная_страница machinelearning.ru]<br />
* [https://yandexdataschool.ru/edu-process/courses/machine-learning Video-lectures of K. Vorontsov on machine learning]<br />
* [https://www.dataoptimal.com/wp-content/uploads/Data-Science-Books-for-2018.pdf Some books for ML1]<br />
* [https://anvaka.github.io/greview/hands-on-ml/1/ Some books for ML2]<br />
* [https://mml-book.github.io/ Math for ML]<br />
* One of classic ML books. [https://web.stanford.edu/~hastie/Papers/ESLII.pdf Elements of Statistical Learning (Trevor Hastie, Robert Tibshirani, Jerome Friedman)]<br />
* [http://immersivemath.com/ila/learnmore.html Linear Algebra Immersive book]<br />
<br />
=== Python ===<br />
* [http://python.org Official website]<br />
* Libraries: [http://www.numpy.org/ NumPy], [http://pandas.pydata.org/ Pandas], [http://scikit-learn.org/stable/ SciKit-Learn], [http://matplotlib.org/ Matplotlib].<br />
* A little example for the begginers: [http://nbviewer.ipython.org/gist/voron13e02/83a86f2e0fc5e7f8424d краткое руководство с примерами по Python 2]<br />
* Python from scratch: [http://nbviewer.ipython.org/gist/rpmuller/5920182 A Crash Course in Python for Scientists]<br />
* Lectures [https://github.com/jrjohansson/scientific-python-lectures#online-read-only-versions Scientific Python]<br />
* A book: [http://www.cin.ufpe.br/~embat/Python%20for%20Data%20Analysis.pdf Wes McKinney «Python for Data Analysis»]<br />
* [https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks Коллекция интересных IPython ноутбуков]<br />
<br />
=== Python installation and configuration ===<br />
<br />
[https://www.continuum.io/downloads anaconda]</div>Mhushchynhttp://wiki.cs.hse.ru/Data_analysis_(Software_Engineering)_2020Data analysis (Software Engineering) 20202020-04-13T14:46:16Z<p>Mhushchyn: </p>
<hr />
<div>'''[https://join.slack.com/t/hse-se-ml/shared_invite/enQtOTgxOTM5MDE5NTU5LTY5MGI2YWEwYWJkNmM5YmFhNDFkYjIwZjU0MTQyNDNmZDZkOTZmNTE4OGNhOGJlMzMwMDU0ZTc0YjRiYzQyMmY Slack Invite Link] <br /><br />
'''Anonymous feedback form:''' [https://goo.gl/forms/xTfnM328m8ulT4FF2 here]<br /><br />
'''[[ Data_analysis_(Software_Engineering)_2019 | Previous Course Page ]]''' <br /><br />
'''[https://github.com/shestakoff/hse_se_ml Course repo]<br/><br />
<br />
<br />
<br /><br />
<br />
== Course description ==<br />
In this class we consider the main problems of data mining and machine learning: classification, clustering, regression, dimensionality reduction, ranking, collaborative filtering. We will also study mathematical methods and concepts which data analysis is based on as well as formal assumptions behind them and various aspects of their implementation.<br />
<br />
A significant attention is given to practical skills of data analysis that will be developed on seminars by studying the Python programming language and relevant libraries for scientific computing.<br />
<br />
The knowledge of linear algebra, real analysis and probability theory is required.<br />
<br />
'''The class consists of:'''<br />
# Lectures and seminars<br />
# Practical and theoretical homework assignments<br />
# A machine learning competition (more information will be available later)<br />
# Midterm theoretical colloquium<br />
# Final exam<br />
<br />
== Colloquium ==<br />
Colloquium will be held on the '''7th of April''' during seminar<br />
<br />
You may not use any materials during colloquium, except single A4 prepared before the exam and handwritten personally by you (from two sides). You will have 2 questions from [https://github.com/shestakoff/hse_se_ml/raw/master/2020/colloq/colloq-2020.pdf '''question list'''] with 15 minutes for preparation and may receive additional questions or tasks.<br />
<br />
== Course Schedule (3rd module)==<br />
===Lectures===<br />
'''Mondays'''<br />
* 10:30-11:50, Room R205<br />
<br />
== Lecture materials ==<br />
<br />
'''Lecture 1. Introduction to data science and machine learning ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l01-intro/lecture-intro.slides#/ Slides] <br/><br />
<br />
'''Lecture 2. Metric-based methods. K-NN ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l02-knn/lecture-knn.slides#/ Slides] <br/><br />
<br />
'''Lecture 3. Decision Trees''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l03-trees/lecture-trees.slides#/ Slides] <br/><br />
<br />
'''Lecture 4. Linear Regression''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l04-linreg/lecture-linreg.slides#/ Slides] <br/><br />
<br />
'''Lecture 5. Linear Classification''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l05-linclass/lecture-linclass.slides#/ Slides] <br/><br />
<br />
'''Lecture 6. Quality measures''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l06-metrics/lecture-metrics.slides#/ Slides], [https://youtu.be/kItcW-G0wzM record] <br/><br />
<br />
'''Lecture 7. Dimension reductio. PCA''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l07-dimred/lecture-dimred.slides#/ Slides] <br/><br />
<br />
'''Lecture 8. NLP Introduction''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l08-nlp-intro/lecture-nlp-intro.slides#/ Slides] <br/><br />
<br />
== Seminars ==<br />
<br />
'''Seminar 1. Introduction to Data Analysis in Python '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s01-intro-to-python Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s01-intro-to-python/seminar1-homework.ipynb Homework 1] '''Due Date: 28.01.2020 23:59'''<br/><br />
<br />
'''Seminar 2. Metric-based methods. K-NN'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s02-metric-based-methods%20 Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s02-metric-based-methods%20/seminar2-homework.ipynb Homework 2] '''Due Date: 04.02.2020 23:59'''<br/><br />
<br />
'''Seminar 3. Decision Trees'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s03-decision-trees Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s03-decision-trees//seminar3-homework.ipynb Homework 3] '''Due Date: 01.03.2020 23:59'''<br/><br />
<br />
'''Seminar 4. Linear Regression'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s04-linear-regression Practice in class] <br/><br />
<br />
'''Seminar 5. Logistic Regression'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s05-logistic-regresion/seminar5-logreg.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s05-logistic-regresion/seminar5-homework.ipynb Homework 4] '''Due Date: 22.03.2020 23:59'''<br/><br />
<br />
'''Seminar 6. Quality Measures'''<br/><br />
[https://github.com/matyushinleonid/hse_se_ml/blob/master/2020/s06-quality-measures/seminar6-quality.ipynb Practice in class] <br/><br />
<br />
'''Seminar 7. Dimention Reduction'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s07-dimred/seminar7-dimred.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s07-dimred/seminar7-homework.ipynb Homework 5] '''Due Date: 12.04.2020 23:59'''<br/><br />
<br />
'''Seminar 8. Introduction to NLP'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s08-nlp Practice in class] <br/><br />
[https://www.kaggle.com/c/explicit-content-detection Kaggle 1] '''Due Date: 12.04.2020 23:59'''<br/><br />
<br />
== Theoretical questions for the colloquium ==<br />
<br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s02-metric-based-methods%20/knn_theory.pdf Metric-based methods. K-NN] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s03-decision-trees/trees_theory.pdf Decision Trees] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s04-linear-regression/linreg_theory.pdf Linear Regression] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s05-logistic-regresion/linclass_theory.pdf Logistic Regression] <br/><br />
[https://github.com/matyushinleonid/hse_se_ml/blob/master/2020/s06-quality-measures/metrics_svm.pdf Quality Measures] <br/><br />
<br />
== Evaluation criteria ==<br />
The course lasts during the 3rd and 4th modules. Knowledge of students is assessed by evaluation of their home assignments and exams. There are two exams during the course – after the 3rd module and after the 4th module respectively. Each of the exams evaluates theoretical knowledge and understanding of the material studied during the respective module.<br />
<br />
Grade takes values 4,5,…10. Grades, corresponding to 1,2,3 are assumed unsatisfactory. Exact grades are calculated using the following rule:<br />
<br />
* '''score''' ≥ 35% => 4,<br />
* '''score''' ≥ 45% => 5,<br />
* ...<br />
* '''score''' ≥ 95% => 10,<br />
<br />
where '''score''' is calculated using the following rule:<br />
<br />
'''score''' = 0.7 * S<sub>cumulative</sub> + 0.3 * S<sub>exam2</sub> <br \><br />
'''cumulative score''' = 0.8 * S<sub>homework</sub> + 0.2 * S<sub>exam1</sub> + 0.2 * S<sub>competition</sub><br />
<br />
* S<sub>homework</sub> – proportion of correctly solved homework,<br />
* S<sub>exam1</sub> – proportion of successfully answered theoretical questions during exam after module 3,<br />
* S<sub>exam2</sub> – proportion of successfully answered theoretical questions during exam after module 4,<br />
* S<sub>competition</sub> – score for the competition in machine learning (it's also from 0 to 1).<br />
<br />
Participation in machine learning competition is optional and can give students extra points. <br \><br />
"Automative" passing of the course based on '''cumulative score''' ''may'' be issued.<br />
<br />
'''Kaggle competition 1'''<br/><br />
«Score» = ("your quality"-"baseline method quality") / ("max achieved quality" - "baseline method quality") <br \><br />
'''Required condition:''' a notebook with your best solution must be reproducible. Otherwise, you will not get any score. <br />
<br />
== Plagiarism ==<br />
In case of discovered plagiarism zero points will be set for the home assignemets - for both works, which were found to be identical. In case of repeated plagiarism by one and the same person a report to the dean will be made.<br />
<br />
== Deadlines ==<br />
<br />
Assignments sent after late deadlines will not be scored (assigned with zero score) in the absence of legitimate reasons for late submission which do not include high load on other classes. <br />
<br />
== Structure of emails and homework submissions ==<br />
Practical assignments must be implemented in Jupyter Notebook format, theoretical ones in pdf. Practical assignments must use '''Python 3''' (or Python 3 compatible). Use your surname as a filename for assignments (e.g. Ivanov.ipynb). Do not archive your assignments.<br />
<br />
Assignments can be performed in either Russian or English.<br />
<br />
'''Assignments can be submitted only once!'''<br />
<br />
== Useful links ==<br />
=== Machine learning, Stats, Maths ===<br />
* [https://github.com/esokolov/ml-course-hse Machine learning course from Evgeny Sokolov on Github]<br />
* [http://www.machinelearning.ru/wiki/index.php?title=Заглавная_страница machinelearning.ru]<br />
* [https://yandexdataschool.ru/edu-process/courses/machine-learning Video-lectures of K. Vorontsov on machine learning]<br />
* [https://www.dataoptimal.com/wp-content/uploads/Data-Science-Books-for-2018.pdf Some books for ML1]<br />
* [https://anvaka.github.io/greview/hands-on-ml/1/ Some books for ML2]<br />
* [https://mml-book.github.io/ Math for ML]<br />
* One of classic ML books. [https://web.stanford.edu/~hastie/Papers/ESLII.pdf Elements of Statistical Learning (Trevor Hastie, Robert Tibshirani, Jerome Friedman)]<br />
* [http://immersivemath.com/ila/learnmore.html Linear Algebra Immersive book]<br />
<br />
=== Python ===<br />
* [http://python.org Official website]<br />
* Libraries: [http://www.numpy.org/ NumPy], [http://pandas.pydata.org/ Pandas], [http://scikit-learn.org/stable/ SciKit-Learn], [http://matplotlib.org/ Matplotlib].<br />
* A little example for the begginers: [http://nbviewer.ipython.org/gist/voron13e02/83a86f2e0fc5e7f8424d краткое руководство с примерами по Python 2]<br />
* Python from scratch: [http://nbviewer.ipython.org/gist/rpmuller/5920182 A Crash Course in Python for Scientists]<br />
* Lectures [https://github.com/jrjohansson/scientific-python-lectures#online-read-only-versions Scientific Python]<br />
* A book: [http://www.cin.ufpe.br/~embat/Python%20for%20Data%20Analysis.pdf Wes McKinney «Python for Data Analysis»]<br />
* [https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks Коллекция интересных IPython ноутбуков]<br />
<br />
=== Python installation and configuration ===<br />
<br />
[https://www.continuum.io/downloads anaconda]</div>Mhushchynhttp://wiki.cs.hse.ru/Data_analysis_(Software_Engineering)_2020Data analysis (Software Engineering) 20202020-04-13T14:35:17Z<p>Mhushchyn: </p>
<hr />
<div>'''[https://join.slack.com/t/hse-se-ml/shared_invite/enQtOTgxOTM5MDE5NTU5LTY5MGI2YWEwYWJkNmM5YmFhNDFkYjIwZjU0MTQyNDNmZDZkOTZmNTE4OGNhOGJlMzMwMDU0ZTc0YjRiYzQyMmY Slack Invite Link] <br /><br />
'''Anonymous feedback form:''' [https://goo.gl/forms/xTfnM328m8ulT4FF2 here]<br /><br />
'''[[ Data_analysis_(Software_Engineering)_2019 | Previous Course Page ]]''' <br /><br />
'''[https://github.com/shestakoff/hse_se_ml Course repo]<br/><br />
<br />
<br />
<br /><br />
<br />
== Course description ==<br />
In this class we consider the main problems of data mining and machine learning: classification, clustering, regression, dimensionality reduction, ranking, collaborative filtering. We will also study mathematical methods and concepts which data analysis is based on as well as formal assumptions behind them and various aspects of their implementation.<br />
<br />
A significant attention is given to practical skills of data analysis that will be developed on seminars by studying the Python programming language and relevant libraries for scientific computing.<br />
<br />
The knowledge of linear algebra, real analysis and probability theory is required.<br />
<br />
'''The class consists of:'''<br />
# Lectures and seminars<br />
# Practical and theoretical homework assignments<br />
# A machine learning competition (more information will be available later)<br />
# Midterm theoretical colloquium<br />
# Final exam<br />
<br />
== Colloquium ==<br />
Colloquium will be held on the '''7th of April''' during seminar<br />
<br />
You may not use any materials during colloquium, except single A4 prepared before the exam and handwritten personally by you (from two sides). You will have 2 questions from [https://github.com/shestakoff/hse_se_ml/raw/master/2020/colloq/colloq-2020.pdf '''question list'''] with 15 minutes for preparation and may receive additional questions or tasks.<br />
<br />
== Course Schedule (3rd module)==<br />
===Lectures===<br />
'''Mondays'''<br />
* 10:30-11:50, Room R205<br />
<br />
== Lecture materials ==<br />
<br />
'''Lecture 1. Introduction to data science and machine learning ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l01-intro/lecture-intro.slides#/ Slides] <br/><br />
<br />
'''Lecture 2. Metric-based methods. K-NN ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l02-knn/lecture-knn.slides#/ Slides] <br/><br />
<br />
'''Lecture 3. Decision Trees''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l03-trees/lecture-trees.slides#/ Slides] <br/><br />
<br />
'''Lecture 4. Linear Regression''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l04-linreg/lecture-linreg.slides#/ Slides] <br/><br />
<br />
'''Lecture 5. Linear Classification''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l05-linclass/lecture-linclass.slides#/ Slides] <br/><br />
<br />
'''Lecture 6. Quality measures''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l06-metrics/lecture-metrics.slides#/ Slides], [https://youtu.be/kItcW-G0wzM record] <br/><br />
<br />
'''Lecture 7. Dimension reductio. PCA''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l07-dimred/lecture-dimred.slides#/ Slides] <br/><br />
<br />
'''Lecture 8. NLP Introduction''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l08-nlp-intro/lecture-nlp-intro.slides#/ Slides] <br/><br />
<br />
== Seminars ==<br />
<br />
'''Seminar 1. Introduction to Data Analysis in Python '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s01-intro-to-python Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s01-intro-to-python/seminar1-homework.ipynb Homework 1] '''Due Date: 28.01.2020 23:59'''<br/><br />
<br />
'''Seminar 2. Metric-based methods. K-NN'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s02-metric-based-methods%20 Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s02-metric-based-methods%20/seminar2-homework.ipynb Homework 2] '''Due Date: 04.02.2020 23:59'''<br/><br />
<br />
'''Seminar 3. Decision Trees'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s03-decision-trees Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s03-decision-trees//seminar3-homework.ipynb Homework 3] '''Due Date: 01.03.2020 23:59'''<br/><br />
<br />
'''Seminar 4. Linear Regression'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s04-linear-regression Practice in class] <br/><br />
<br />
'''Seminar 5. Logistic Regression'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s05-logistic-regresion/seminar5-logreg.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s05-logistic-regresion/seminar5-homework.ipynb Homework 4] '''Due Date: 22.03.2020 23:59'''<br/><br />
<br />
'''Seminar 6. Quality Measures'''<br/><br />
[https://github.com/matyushinleonid/hse_se_ml/blob/master/2020/s06-quality-measures/seminar6-quality.ipynb Practice in class] <br/><br />
<br />
'''Seminar 7. Dimention Reduction'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s07-dimred/seminar7-dimred.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s07-dimred/seminar7-homework.ipynb Homework 5] '''Due Date: 12.04.2020 23:59'''<br/><br />
<br />
'''Seminar 8. Introduction to NLP'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s08-nlp Practice in class] <br/><br />
[https://www.kaggle.com/c/explicit-content-detection Kaggle 1] '''Due Date: 12.04.2020 23:59'''<br/><br />
<br />
== Theoretical questions for the colloquium ==<br />
<br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s02-metric-based-methods%20/knn_theory.pdf Metric-based methods. K-NN] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s03-decision-trees/trees_theory.pdf Decision Trees] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s04-linear-regression/linreg_theory.pdf Linear Regression] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s05-logistic-regresion/linclass_theory.pdf Logistic Regression] <br/><br />
[https://github.com/matyushinleonid/hse_se_ml/blob/master/2020/s06-quality-measures/metrics_svm.pdf Quality Measures] <br/><br />
<br />
== Evaluation criteria ==<br />
The course lasts during the 3rd and 4th modules. Knowledge of students is assessed by evaluation of their home assignments and exams. There are two exams during the course – after the 3rd module and after the 4th module respectively. Each of the exams evaluates theoretical knowledge and understanding of the material studied during the respective module.<br />
<br />
Grade takes values 4,5,…10. Grades, corresponding to 1,2,3 are assumed unsatisfactory. Exact grades are calculated using the following rule:<br />
<br />
* '''score''' ≥ 35% => 4,<br />
* '''score''' ≥ 45% => 5,<br />
* ...<br />
* '''score''' ≥ 95% => 10,<br />
<br />
where '''score''' is calculated using the following rule:<br />
<br />
'''score''' = 0.7 * S<sub>cumulative</sub> + 0.3 * S<sub>exam2</sub> <br \><br />
'''cumulative score''' = 0.8 * S<sub>homework</sub> + 0.2 * S<sub>exam1</sub> + 0.2 * S<sub>competition</sub><br />
<br />
* S<sub>homework</sub> – proportion of correctly solved homework,<br />
* S<sub>exam1</sub> – proportion of successfully answered theoretical questions during exam after module 3,<br />
* S<sub>exam2</sub> – proportion of successfully answered theoretical questions during exam after module 4,<br />
* S<sub>competition</sub> – score for the competition in machine learning (it's also from 0 to 1).<br />
<br />
Participation in machine learning competition is optional and can give students extra points. <br \><br />
"Automative" passing of the course based on '''cumulative score''' ''may'' be issued.<br />
<br />
'''Kaggle competition 1'''<br/><br />
«Quality grade» = ("your quality"-"baseline method quality") / ("max achieved quality" - "baseline method quality") <br \><br />
'''Required condition:''' a notebook with your best solution must be reproducible.<br />
<br />
== Plagiarism ==<br />
In case of discovered plagiarism zero points will be set for the home assignemets - for both works, which were found to be identical. In case of repeated plagiarism by one and the same person a report to the dean will be made.<br />
<br />
== Deadlines ==<br />
<br />
Assignments sent after late deadlines will not be scored (assigned with zero score) in the absence of legitimate reasons for late submission which do not include high load on other classes. <br />
<br />
== Structure of emails and homework submissions ==<br />
Practical assignments must be implemented in Jupyter Notebook format, theoretical ones in pdf. Practical assignments must use '''Python 3''' (or Python 3 compatible). Use your surname as a filename for assignments (e.g. Ivanov.ipynb). Do not archive your assignments.<br />
<br />
Assignments can be performed in either Russian or English.<br />
<br />
'''Assignments can be submitted only once!'''<br />
<br />
== Useful links ==<br />
=== Machine learning, Stats, Maths ===<br />
* [https://github.com/esokolov/ml-course-hse Machine learning course from Evgeny Sokolov on Github]<br />
* [http://www.machinelearning.ru/wiki/index.php?title=Заглавная_страница machinelearning.ru]<br />
* [https://yandexdataschool.ru/edu-process/courses/machine-learning Video-lectures of K. Vorontsov on machine learning]<br />
* [https://www.dataoptimal.com/wp-content/uploads/Data-Science-Books-for-2018.pdf Some books for ML1]<br />
* [https://anvaka.github.io/greview/hands-on-ml/1/ Some books for ML2]<br />
* [https://mml-book.github.io/ Math for ML]<br />
* One of classic ML books. [https://web.stanford.edu/~hastie/Papers/ESLII.pdf Elements of Statistical Learning (Trevor Hastie, Robert Tibshirani, Jerome Friedman)]<br />
* [http://immersivemath.com/ila/learnmore.html Linear Algebra Immersive book]<br />
<br />
=== Python ===<br />
* [http://python.org Official website]<br />
* Libraries: [http://www.numpy.org/ NumPy], [http://pandas.pydata.org/ Pandas], [http://scikit-learn.org/stable/ SciKit-Learn], [http://matplotlib.org/ Matplotlib].<br />
* A little example for the begginers: [http://nbviewer.ipython.org/gist/voron13e02/83a86f2e0fc5e7f8424d краткое руководство с примерами по Python 2]<br />
* Python from scratch: [http://nbviewer.ipython.org/gist/rpmuller/5920182 A Crash Course in Python for Scientists]<br />
* Lectures [https://github.com/jrjohansson/scientific-python-lectures#online-read-only-versions Scientific Python]<br />
* A book: [http://www.cin.ufpe.br/~embat/Python%20for%20Data%20Analysis.pdf Wes McKinney «Python for Data Analysis»]<br />
* [https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks Коллекция интересных IPython ноутбуков]<br />
<br />
=== Python installation and configuration ===<br />
<br />
[https://www.continuum.io/downloads anaconda]</div>Mhushchynhttp://wiki.cs.hse.ru/Data_analysis_(Software_Engineering)_2020Data analysis (Software Engineering) 20202020-01-27T21:14:29Z<p>Mhushchyn: </p>
<hr />
<div>'''[https://join.slack.com/t/hse-se-ml/shared_invite/enQtOTAyNjQ3ODk4MjczLWZkMDA5OWFiZWVmZjEzMWU2NThjZjc5MjEwZGM0NDBmYjEwOTlmYTI4ZWE0YmMxMjk0OTQxMTdlNjY0MWMyZTk Slack Invite Link] <br /><br />
'''Anonymous feedback form:''' [https://goo.gl/forms/xTfnM328m8ulT4FF2 here]<br /><br />
'''[[ Data_analysis_(Software_Engineering)_2019 | Previous Course Page ]]''' <br /><br />
'''[https://github.com/shestakoff/hse_se_ml Course repo]<br/><br />
<br />
<br />
<br /><br />
<br />
== Course description ==<br />
In this class we consider the main problems of data mining and machine learning: classification, clustering, regression, dimensionality reduction, ranking, collaborative filtering. We will also study mathematical methods and concepts which data analysis is based on as well as formal assumptions behind them and various aspects of their implementation.<br />
<br />
A significant attention is given to practical skills of data analysis that will be developed on seminars by studying the Python programming language and relevant libraries for scientific computing.<br />
<br />
The knowledge of linear algebra, real analysis and probability theory is required.<br />
<br />
'''The class consists of:'''<br />
# Lectures and seminars<br />
# Practical and theoretical homework assignments<br />
# A machine learning competition (more information will be available later)<br />
# Midterm theoretical colloquium<br />
# Final exam<br />
<br />
== Course Schedule (3rd module)==<br />
===Lectures===<br />
'''Mondays'''<br />
* 10:30-11:50, Room R205<br />
<br />
== Lecture materials ==<br />
<br />
'''Lecture 1. Introduction to data science and machine learning ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l01-intro/lecture-intro.slides#/ Slides] <br/><br />
<br />
'''Lecture 2. Metric-based methods. K-NN ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l02-knn/lecture-knn.slides#/ Slides] <br/><br />
<br />
<br />
<br />
== Seminars ==<br />
<br />
'''Seminar 1. Introduction to Data Analysis in Python '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s01-intro-to-python Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s01-intro-to-python/seminar1-homework.ipynb Homework 1] '''Due Date: 28.01.2020 23:59'''<br/><br />
<br />
'''Seminar 2. Metric-based methods. K-NN'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s02-metric-based-methods%20 Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s02-metric-based-methods%20/seminar2-homework.ipynb Homework 2] '''Due Date: 04.02.2020 23:59'''<br/><br />
<br />
<br />
<br />
== Theoretical questions for the colloquium ==<br />
<br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s02-metric-based-methods%20/knn_theory.pdf Metric-based methods. K-NN] <br/><br />
<br />
<br />
<br />
== Evaluation criteria ==<br />
The course lasts during the 3rd and 4th modules. Knowledge of students is assessed by evaluation of their home assignments and exams. There are two exams during the course – after the 3rd module and after the 4th module respectively. Each of the exams evaluates theoretical knowledge and understanding of the material studied during the respective module.<br />
<br />
Grade takes values 4,5,…10. Grades, corresponding to 1,2,3 are assumed unsatisfactory. Exact grades are calculated using the following rule:<br />
<br />
* '''score''' ≥ 35% => 4,<br />
* '''score''' ≥ 45% => 5,<br />
* ...<br />
* '''score''' ≥ 95% => 10,<br />
<br />
where '''score''' is calculated using the following rule:<br />
<br />
'''score''' = 0.7 * S<sub>cumulative</sub> + 0.3 * S<sub>exam2</sub> <br \><br />
'''cumulative score''' = 0.8 * S<sub>homework</sub> + 0.2 * S<sub>exam1</sub> + 0.2 * S<sub>competition</sub><br />
<br />
* S<sub>homework</sub> – proportion of correctly solved homework,<br />
* S<sub>exam1</sub> – proportion of successfully answered theoretical questions during exam after module 3,<br />
* S<sub>exam2</sub> – proportion of successfully answered theoretical questions during exam after module 4,<br />
* S<sub>competition</sub> – score for the competition in machine learning (it's also from 0 to 1).<br />
<br />
Participation in machine learning competition is optional and can give students extra points. <br \><br />
"Automative" passing of the course based on '''cumulative score''' ''may'' be issued.<br />
<br />
== Plagiarism ==<br />
In case of discovered plagiarism zero points will be set for the home assignemets - for both works, which were found to be identical. In case of repeated plagiarism by one and the same person a report to the dean will be made.<br />
<br />
== Deadlines ==<br />
<br />
Assignments sent after late deadlines will not be scored (assigned with zero score) in the absence of legitimate reasons for late submission which do not include high load on other classes. <br />
<br />
== Structure of emails and homework submissions ==<br />
Practical assignments must be implemented in Jupyter Notebook format, theoretical ones in pdf. Practical assignments must use '''Python 3''' (or Python 3 compatible). Use your surname as a filename for assignments (e.g. Ivanov.ipynb). Do not archive your assignments.<br />
<br />
Assignments can be performed in either Russian or English.<br />
<br />
'''Assignments can be submitted only once!'''<br />
<br />
== Useful links ==<br />
=== Machine learning, Stats, Maths ===<br />
* [https://github.com/esokolov/ml-course-hse Machine learning course from Evgeny Sokolov on Github]<br />
* [http://www.machinelearning.ru/wiki/index.php?title=Заглавная_страница machinelearning.ru]<br />
* [https://yandexdataschool.ru/edu-process/courses/machine-learning Video-lectures of K. Vorontsov on machine learning]<br />
* [https://www.dataoptimal.com/wp-content/uploads/Data-Science-Books-for-2018.pdf Some books for ML1]<br />
* [https://anvaka.github.io/greview/hands-on-ml/1/ Some books for ML2]<br />
* [https://mml-book.github.io/ Math for ML]<br />
* One of classic ML books. [https://web.stanford.edu/~hastie/Papers/ESLII.pdf Elements of Statistical Learning (Trevor Hastie, Robert Tibshirani, Jerome Friedman)]<br />
* [http://immersivemath.com/ila/learnmore.html Linear Algebra Immersive book]<br />
<br />
=== Python ===<br />
* [http://python.org Official website]<br />
* Libraries: [http://www.numpy.org/ NumPy], [http://pandas.pydata.org/ Pandas], [http://scikit-learn.org/stable/ SciKit-Learn], [http://matplotlib.org/ Matplotlib].<br />
* A little example for the begginers: [http://nbviewer.ipython.org/gist/voron13e02/83a86f2e0fc5e7f8424d краткое руководство с примерами по Python 2]<br />
* Python from scratch: [http://nbviewer.ipython.org/gist/rpmuller/5920182 A Crash Course in Python for Scientists]<br />
* Lectures [https://github.com/jrjohansson/scientific-python-lectures#online-read-only-versions Scientific Python]<br />
* A book: [http://www.cin.ufpe.br/~embat/Python%20for%20Data%20Analysis.pdf Wes McKinney «Python for Data Analysis»]<br />
* [https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks Коллекция интересных IPython ноутбуков]<br />
<br />
=== Python installation and configuration ===<br />
<br />
[https://www.continuum.io/downloads anaconda]</div>Mhushchynhttp://wiki.cs.hse.ru/Data_analysis_(Software_Engineering)_2020Data analysis (Software Engineering) 20202020-01-27T21:13:34Z<p>Mhushchyn: </p>
<hr />
<div>'''[https://join.slack.com/t/hse-se-ml/shared_invite/enQtOTAyNjQ3ODk4MjczLWZkMDA5OWFiZWVmZjEzMWU2NThjZjc5MjEwZGM0NDBmYjEwOTlmYTI4ZWE0YmMxMjk0OTQxMTdlNjY0MWMyZTk Slack Invite Link] <br /><br />
'''Anonymous feedback form:''' [https://goo.gl/forms/xTfnM328m8ulT4FF2 here]<br /><br />
'''[[ Data_analysis_(Software_Engineering)_2019 | Previous Course Page ]]''' <br /><br />
'''[https://github.com/shestakoff/hse_se_ml Course repo]<br/><br />
<br />
<br />
<br /><br />
<br />
== Course description ==<br />
In this class we consider the main problems of data mining and machine learning: classification, clustering, regression, dimensionality reduction, ranking, collaborative filtering. We will also study mathematical methods and concepts which data analysis is based on as well as formal assumptions behind them and various aspects of their implementation.<br />
<br />
A significant attention is given to practical skills of data analysis that will be developed on seminars by studying the Python programming language and relevant libraries for scientific computing.<br />
<br />
The knowledge of linear algebra, real analysis and probability theory is required.<br />
<br />
'''The class consists of:'''<br />
# Lectures and seminars<br />
# Practical and theoretical homework assignments<br />
# A machine learning competition (more information will be available later)<br />
# Midterm theoretical colloquium<br />
# Final exam<br />
<br />
== Course Schedule (3rd module)==<br />
===Lectures===<br />
'''Mondays'''<br />
* 10:30-11:50, Room R205<br />
<br />
== Lecture materials ==<br />
<br />
'''Lecture 1. Introduction to data science and machine learning ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l01-intro/lecture-intro.slides#/ Slides] <br/><br />
<br />
'''Lecture 2. Metric-based methods. K-NN ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l02-knn/lecture-knn.slides#/ Slides] <br/><br />
<br />
<br />
<br />
== Seminars ==<br />
<br />
'''Seminar 1. Introduction to Data Analysis in Python '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s01-intro-to-python Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s01-intro-to-python/seminar1-homework.ipynb Homework 1] '''Due Date: 28.01.2020 23:59'''<br/><br />
<br />
'''Seminar 2. Metric-based methods. K-NN'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s02-metric-based-methods%20 Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s02-metric-based-methods%20/seminar2-knn.ipynb Homework 2] '''Due Date: 04.02.2020 23:59'''<br/><br />
<br />
<br />
<br />
== Theoretical questions for the colloquium ==<br />
<br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s02-metric-based-methods%20/knn_theory.pdf Metric-based methods. K-NN] <br/><br />
<br />
<br />
<br />
== Evaluation criteria ==<br />
The course lasts during the 3rd and 4th modules. Knowledge of students is assessed by evaluation of their home assignments and exams. There are two exams during the course – after the 3rd module and after the 4th module respectively. Each of the exams evaluates theoretical knowledge and understanding of the material studied during the respective module.<br />
<br />
Grade takes values 4,5,…10. Grades, corresponding to 1,2,3 are assumed unsatisfactory. Exact grades are calculated using the following rule:<br />
<br />
* '''score''' ≥ 35% => 4,<br />
* '''score''' ≥ 45% => 5,<br />
* ...<br />
* '''score''' ≥ 95% => 10,<br />
<br />
where '''score''' is calculated using the following rule:<br />
<br />
'''score''' = 0.7 * S<sub>cumulative</sub> + 0.3 * S<sub>exam2</sub> <br \><br />
'''cumulative score''' = 0.8 * S<sub>homework</sub> + 0.2 * S<sub>exam1</sub> + 0.2 * S<sub>competition</sub><br />
<br />
* S<sub>homework</sub> – proportion of correctly solved homework,<br />
* S<sub>exam1</sub> – proportion of successfully answered theoretical questions during exam after module 3,<br />
* S<sub>exam2</sub> – proportion of successfully answered theoretical questions during exam after module 4,<br />
* S<sub>competition</sub> – score for the competition in machine learning (it's also from 0 to 1).<br />
<br />
Participation in machine learning competition is optional and can give students extra points. <br \><br />
"Automative" passing of the course based on '''cumulative score''' ''may'' be issued.<br />
<br />
== Plagiarism ==<br />
In case of discovered plagiarism zero points will be set for the home assignemets - for both works, which were found to be identical. In case of repeated plagiarism by one and the same person a report to the dean will be made.<br />
<br />
== Deadlines ==<br />
<br />
Assignments sent after late deadlines will not be scored (assigned with zero score) in the absence of legitimate reasons for late submission which do not include high load on other classes. <br />
<br />
== Structure of emails and homework submissions ==<br />
Practical assignments must be implemented in Jupyter Notebook format, theoretical ones in pdf. Practical assignments must use '''Python 3''' (or Python 3 compatible). Use your surname as a filename for assignments (e.g. Ivanov.ipynb). Do not archive your assignments.<br />
<br />
Assignments can be performed in either Russian or English.<br />
<br />
'''Assignments can be submitted only once!'''<br />
<br />
== Useful links ==<br />
=== Machine learning, Stats, Maths ===<br />
* [https://github.com/esokolov/ml-course-hse Machine learning course from Evgeny Sokolov on Github]<br />
* [http://www.machinelearning.ru/wiki/index.php?title=Заглавная_страница machinelearning.ru]<br />
* [https://yandexdataschool.ru/edu-process/courses/machine-learning Video-lectures of K. Vorontsov on machine learning]<br />
* [https://www.dataoptimal.com/wp-content/uploads/Data-Science-Books-for-2018.pdf Some books for ML1]<br />
* [https://anvaka.github.io/greview/hands-on-ml/1/ Some books for ML2]<br />
* [https://mml-book.github.io/ Math for ML]<br />
* One of classic ML books. [https://web.stanford.edu/~hastie/Papers/ESLII.pdf Elements of Statistical Learning (Trevor Hastie, Robert Tibshirani, Jerome Friedman)]<br />
* [http://immersivemath.com/ila/learnmore.html Linear Algebra Immersive book]<br />
<br />
=== Python ===<br />
* [http://python.org Official website]<br />
* Libraries: [http://www.numpy.org/ NumPy], [http://pandas.pydata.org/ Pandas], [http://scikit-learn.org/stable/ SciKit-Learn], [http://matplotlib.org/ Matplotlib].<br />
* A little example for the begginers: [http://nbviewer.ipython.org/gist/voron13e02/83a86f2e0fc5e7f8424d краткое руководство с примерами по Python 2]<br />
* Python from scratch: [http://nbviewer.ipython.org/gist/rpmuller/5920182 A Crash Course in Python for Scientists]<br />
* Lectures [https://github.com/jrjohansson/scientific-python-lectures#online-read-only-versions Scientific Python]<br />
* A book: [http://www.cin.ufpe.br/~embat/Python%20for%20Data%20Analysis.pdf Wes McKinney «Python for Data Analysis»]<br />
* [https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks Коллекция интересных IPython ноутбуков]<br />
<br />
=== Python installation and configuration ===<br />
<br />
[https://www.continuum.io/downloads anaconda]</div>Mhushchynhttp://wiki.cs.hse.ru/Data_analysis_(Software_Engineering)_2020Data analysis (Software Engineering) 20202020-01-27T21:12:03Z<p>Mhushchyn: </p>
<hr />
<div>'''[https://join.slack.com/t/hse-se-ml/shared_invite/enQtOTAyNjQ3ODk4MjczLWZkMDA5OWFiZWVmZjEzMWU2NThjZjc5MjEwZGM0NDBmYjEwOTlmYTI4ZWE0YmMxMjk0OTQxMTdlNjY0MWMyZTk Slack Invite Link] <br /><br />
'''Anonymous feedback form:''' [https://goo.gl/forms/xTfnM328m8ulT4FF2 here]<br /><br />
'''[[ Data_analysis_(Software_Engineering)_2019 | Previous Course Page ]]''' <br /><br />
'''[https://github.com/shestakoff/hse_se_ml Course repo]<br/><br />
<br />
<br />
<br /><br />
<br />
== Course description ==<br />
In this class we consider the main problems of data mining and machine learning: classification, clustering, regression, dimensionality reduction, ranking, collaborative filtering. We will also study mathematical methods and concepts which data analysis is based on as well as formal assumptions behind them and various aspects of their implementation.<br />
<br />
A significant attention is given to practical skills of data analysis that will be developed on seminars by studying the Python programming language and relevant libraries for scientific computing.<br />
<br />
The knowledge of linear algebra, real analysis and probability theory is required.<br />
<br />
'''The class consists of:'''<br />
# Lectures and seminars<br />
# Practical and theoretical homework assignments<br />
# A machine learning competition (more information will be available later)<br />
# Midterm theoretical colloquium<br />
# Final exam<br />
<br />
== Course Schedule (3rd module)==<br />
===Lectures===<br />
'''Mondays'''<br />
* 10:30-11:50, Room R205<br />
<br />
== Lecture materials ==<br />
<br />
'''Lecture 1. Introduction to data science and machine learning ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l01-intro/lecture-intro.slides#/ Slides] <br/><br />
<br />
'''Lecture 2. Metric-based methods. K-NN ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l02-knn/lecture-knn.slides#/ Slides] <br/><br />
<br />
<br />
<br />
== Seminars ==<br />
<br />
'''Seminar 1. Introduction to Data Analysis in Python '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s01-intro-to-python Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s01-intro-to-python/seminar1-homework.ipynb Howmework 1] '''Due Date: 28.01.2020 23:59'''<br/><br />
<br />
'''Seminar 2. Metric-based methods. K-NN'''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s02-metric-based-methods%20 Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s02-metric-based-methods%20/seminar2-knn.ipynb Howmework 2] '''Due Date: 04.02.2020 23:59'''<br/><br />
<br />
<br />
<br />
== Theoretical questions for the colloquium ==<br />
<br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s02-metric-based-methods%20/knn_theory.pdf Metric-based methods. K-NN] <br/><br />
<br />
<br />
<br />
== Evaluation criteria ==<br />
The course lasts during the 3rd and 4th modules. Knowledge of students is assessed by evaluation of their home assignments and exams. There are two exams during the course – after the 3rd module and after the 4th module respectively. Each of the exams evaluates theoretical knowledge and understanding of the material studied during the respective module.<br />
<br />
Grade takes values 4,5,…10. Grades, corresponding to 1,2,3 are assumed unsatisfactory. Exact grades are calculated using the following rule:<br />
<br />
* '''score''' ≥ 35% => 4,<br />
* '''score''' ≥ 45% => 5,<br />
* ...<br />
* '''score''' ≥ 95% => 10,<br />
<br />
where '''score''' is calculated using the following rule:<br />
<br />
'''score''' = 0.7 * S<sub>cumulative</sub> + 0.3 * S<sub>exam2</sub> <br \><br />
'''cumulative score''' = 0.8 * S<sub>homework</sub> + 0.2 * S<sub>exam1</sub> + 0.2 * S<sub>competition</sub><br />
<br />
* S<sub>homework</sub> – proportion of correctly solved homework,<br />
* S<sub>exam1</sub> – proportion of successfully answered theoretical questions during exam after module 3,<br />
* S<sub>exam2</sub> – proportion of successfully answered theoretical questions during exam after module 4,<br />
* S<sub>competition</sub> – score for the competition in machine learning (it's also from 0 to 1).<br />
<br />
Participation in machine learning competition is optional and can give students extra points. <br \><br />
"Automative" passing of the course based on '''cumulative score''' ''may'' be issued.<br />
<br />
== Plagiarism ==<br />
In case of discovered plagiarism zero points will be set for the home assignemets - for both works, which were found to be identical. In case of repeated plagiarism by one and the same person a report to the dean will be made.<br />
<br />
== Deadlines ==<br />
<br />
Assignments sent after late deadlines will not be scored (assigned with zero score) in the absence of legitimate reasons for late submission which do not include high load on other classes. <br />
<br />
== Structure of emails and homework submissions ==<br />
Practical assignments must be implemented in Jupyter Notebook format, theoretical ones in pdf. Practical assignments must use '''Python 3''' (or Python 3 compatible). Use your surname as a filename for assignments (e.g. Ivanov.ipynb). Do not archive your assignments.<br />
<br />
Assignments can be performed in either Russian or English.<br />
<br />
'''Assignments can be submitted only once!'''<br />
<br />
== Useful links ==<br />
=== Machine learning, Stats, Maths ===<br />
* [https://github.com/esokolov/ml-course-hse Machine learning course from Evgeny Sokolov on Github]<br />
* [http://www.machinelearning.ru/wiki/index.php?title=Заглавная_страница machinelearning.ru]<br />
* [https://yandexdataschool.ru/edu-process/courses/machine-learning Video-lectures of K. Vorontsov on machine learning]<br />
* [https://www.dataoptimal.com/wp-content/uploads/Data-Science-Books-for-2018.pdf Some books for ML1]<br />
* [https://anvaka.github.io/greview/hands-on-ml/1/ Some books for ML2]<br />
* [https://mml-book.github.io/ Math for ML]<br />
* One of classic ML books. [https://web.stanford.edu/~hastie/Papers/ESLII.pdf Elements of Statistical Learning (Trevor Hastie, Robert Tibshirani, Jerome Friedman)]<br />
* [http://immersivemath.com/ila/learnmore.html Linear Algebra Immersive book]<br />
<br />
=== Python ===<br />
* [http://python.org Official website]<br />
* Libraries: [http://www.numpy.org/ NumPy], [http://pandas.pydata.org/ Pandas], [http://scikit-learn.org/stable/ SciKit-Learn], [http://matplotlib.org/ Matplotlib].<br />
* A little example for the begginers: [http://nbviewer.ipython.org/gist/voron13e02/83a86f2e0fc5e7f8424d краткое руководство с примерами по Python 2]<br />
* Python from scratch: [http://nbviewer.ipython.org/gist/rpmuller/5920182 A Crash Course in Python for Scientists]<br />
* Lectures [https://github.com/jrjohansson/scientific-python-lectures#online-read-only-versions Scientific Python]<br />
* A book: [http://www.cin.ufpe.br/~embat/Python%20for%20Data%20Analysis.pdf Wes McKinney «Python for Data Analysis»]<br />
* [https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks Коллекция интересных IPython ноутбуков]<br />
<br />
=== Python installation and configuration ===<br />
<br />
[https://www.continuum.io/downloads anaconda]</div>Mhushchynhttp://wiki.cs.hse.ru/Data_analysis_(Software_Engineering)_2020Data analysis (Software Engineering) 20202020-01-20T18:45:29Z<p>Mhushchyn: </p>
<hr />
<div>'''[https://join.slack.com/t/hse-se-ml/shared_invite/enQtOTAyNjQ3ODk4MjczLWZkMDA5OWFiZWVmZjEzMWU2NThjZjc5MjEwZGM0NDBmYjEwOTlmYTI4ZWE0YmMxMjk0OTQxMTdlNjY0MWMyZTk Slack Invite Link] <br /><br />
'''Anonymous feedback form:''' [https://goo.gl/forms/xTfnM328m8ulT4FF2 here]<br /><br />
'''[[ Data_analysis_(Software_Engineering)_2019 | Previous Course Page ]]''' <br /><br />
'''[https://github.com/shestakoff/hse_se_ml Course repo]<br/><br />
<br />
<br />
<br /><br />
<br />
== Course description ==<br />
In this class we consider the main problems of data mining and machine learning: classification, clustering, regression, dimensionality reduction, ranking, collaborative filtering. We will also study mathematical methods and concepts which data analysis is based on as well as formal assumptions behind them and various aspects of their implementation.<br />
<br />
A significant attention is given to practical skills of data analysis that will be developed on seminars by studying the Python programming language and relevant libraries for scientific computing.<br />
<br />
The knowledge of linear algebra, real analysis and probability theory is required.<br />
<br />
'''The class consists of:'''<br />
# Lectures and seminars<br />
# Practical and theoretical homework assignments<br />
# A machine learning competition (more information will be available later)<br />
# Midterm theoretical colloquium<br />
# Final exam<br />
<br />
== Course Schedule (3rd module)==<br />
===Lectures===<br />
'''Mondays'''<br />
* 10:30-11:50, Room R205<br />
<br />
== Lecture materials ==<br />
<br />
'''Lecture 1. Introduction to data science and machine learning ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2020/l01-intro/lecture-intro.slides#/ Slides] <br/><br />
<br />
== Seminars ==<br />
<br />
'''Seminar 1. Introduction to Data Analysis in Python '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s01-intro-to-python Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2020/s01-intro-to-python/seminar1-homework.ipynb Howmework 1] <br/><br />
<br />
== Evaluation criteria ==<br />
The course lasts during the 3rd and 4th modules. Knowledge of students is assessed by evaluation of their home assignments and exams. There are two exams during the course – after the 3rd module and after the 4th module respectively. Each of the exams evaluates theoretical knowledge and understanding of the material studied during the respective module.<br />
<br />
Grade takes values 4,5,…10. Grades, corresponding to 1,2,3 are assumed unsatisfactory. Exact grades are calculated using the following rule:<br />
<br />
* '''score''' ≥ 35% => 4,<br />
* '''score''' ≥ 45% => 5,<br />
* ...<br />
* '''score''' ≥ 95% => 10,<br />
<br />
where '''score''' is calculated using the following rule:<br />
<br />
'''score''' = 0.7 * S<sub>cumulative</sub> + 0.3 * S<sub>exam2</sub> <br \><br />
'''cumulative score''' = 0.8 * S<sub>homework</sub> + 0.2 * S<sub>exam1</sub> + 0.2 * S<sub>competition</sub><br />
<br />
* S<sub>homework</sub> – proportion of correctly solved homework,<br />
* S<sub>exam1</sub> – proportion of successfully answered theoretical questions during exam after module 3,<br />
* S<sub>exam2</sub> – proportion of successfully answered theoretical questions during exam after module 4,<br />
* S<sub>competition</sub> – score for the competition in machine learning (it's also from 0 to 1).<br />
<br />
Participation in machine learning competition is optional and can give students extra points. <br \><br />
"Automative" passing of the course based on '''cumulative score''' ''may'' be issued.<br />
<br />
== Plagiarism ==<br />
In case of discovered plagiarism zero points will be set for the home assignemets - for both works, which were found to be identical. In case of repeated plagiarism by one and the same person a report to the dean will be made.<br />
<br />
== Deadlines ==<br />
<br />
Assignments sent after late deadlines will not be scored (assigned with zero score) in the absence of legitimate reasons for late submission which do not include high load on other classes. <br />
<br />
== Structure of emails and homework submissions ==<br />
Practical assignments must be implemented in Jupyter Notebook format, theoretical ones in pdf. Practical assignments must use '''Python 3''' (or Python 3 compatible). Use your surname as a filename for assignments (e.g. Ivanov.ipynb). Do not archive your assignments.<br />
<br />
Assignments can be performed in either Russian or English.<br />
<br />
'''Assignments can be submitted only once!'''<br />
<br />
== Useful links ==<br />
=== Machine learning, Stats, Maths ===<br />
* [https://github.com/esokolov/ml-course-hse Machine learning course from Evgeny Sokolov on Github]<br />
* [http://www.machinelearning.ru/wiki/index.php?title=Заглавная_страница machinelearning.ru]<br />
* [https://yandexdataschool.ru/edu-process/courses/machine-learning Video-lectures of K. Vorontsov on machine learning]<br />
* [https://www.dataoptimal.com/wp-content/uploads/Data-Science-Books-for-2018.pdf Some books for ML1]<br />
* [https://anvaka.github.io/greview/hands-on-ml/1/ Some books for ML2]<br />
* [https://mml-book.github.io/ Math for ML]<br />
* One of classic ML books. [https://web.stanford.edu/~hastie/Papers/ESLII.pdf Elements of Statistical Learning (Trevor Hastie, Robert Tibshirani, Jerome Friedman)]<br />
* [http://immersivemath.com/ila/learnmore.html Linear Algebra Immersive book]<br />
<br />
=== Python ===<br />
* [http://python.org Official website]<br />
* Libraries: [http://www.numpy.org/ NumPy], [http://pandas.pydata.org/ Pandas], [http://scikit-learn.org/stable/ SciKit-Learn], [http://matplotlib.org/ Matplotlib].<br />
* A little example for the begginers: [http://nbviewer.ipython.org/gist/voron13e02/83a86f2e0fc5e7f8424d краткое руководство с примерами по Python 2]<br />
* Python from scratch: [http://nbviewer.ipython.org/gist/rpmuller/5920182 A Crash Course in Python for Scientists]<br />
* Lectures [https://github.com/jrjohansson/scientific-python-lectures#online-read-only-versions Scientific Python]<br />
* A book: [http://www.cin.ufpe.br/~embat/Python%20for%20Data%20Analysis.pdf Wes McKinney «Python for Data Analysis»]<br />
* [https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks Коллекция интересных IPython ноутбуков]<br />
<br />
=== Python installation and configuration ===<br />
<br />
[https://www.continuum.io/downloads anaconda]</div>Mhushchynhttp://wiki.cs.hse.ru/Data_analysis_(Software_Engineering)_2019Data analysis (Software Engineering) 20192019-06-03T08:52:22Z<p>Mhushchyn: </p>
<hr />
<div>'''[https://docs.google.com/spreadsheets/d/1qKJtHeqXeTrDMlzxWXORiaTbUBM1QMd2DJjpA1eshm8/edit?usp=sharing Scores]''' <br /><br />
'''[https://join.slack.com/t/hse-se-ml/shared_invite/enQtNTc4NzUzODIwMjI0LTlhYTQxYmQxZmI5NTE4NDY0MjdlMWNjZTJhMzdlZDUzNmJhZWYyZmRkOTY0Zjc3NDE1OWMwOWEzOTdmNTI3YmE Slack Invite Link] <br /><br />
'''Anonymous feedback form:''' [https://goo.gl/forms/xTfnM328m8ulT4FF2 here]<br /><br />
'''[[ Data_analysis_(Software_Engineering)_2018 | Previous Course Page ]]''' <br /><br />
'''[https://github.com/shestakoff/hse_se_ml Course repo]<br/><br />
<br />
<br />
<br /><br />
<br />
== Course description ==<br />
In this class we consider the main problems of data mining and machine learning: classification, clustering, regression, dimensionality reduction, ranking, collaborative filtering. We will also study mathematical methods and concepts which data analysis is based on as well as formal assumptions behind them and various aspects of their implementation.<br />
<br />
A significant attention is given to practical skills of data analysis that will be developed on seminars by studying the Python programming language and relevant libraries for scientific computing.<br />
<br />
The knowledge of linear algebra, real analysis and probability theory is required.<br />
<br />
'''The class consists of:'''<br />
# Lectures and seminars<br />
# Practical and theoretical homework assignments<br />
# A machine learning competition (more information will be available later)<br />
# Midterm theoretical colloquium<br />
# Final exam<br />
<br />
== Kaggle ==<br />
Link to competition is in slack<br />
<br />
You should send reports before June 14 23:59 (Competition ends on the 13th of June). Reports should be sent [https://www.dropbox.com/request/YTUb8bJyVhMoTl6jVnGT here]. Try to follow the format of report template - https://github.com/shestakoff/hse_se_ml/blob/master/2019/kaggle/kaggle-report-template.ipynb<br />
<br />
== Colloquium ==<br />
Colloquium will be held on the '''1th and 2nd of April''' during seminars and lecture<br />
<br />
You may not use any materials during colloquium, except single A4 prepared before the exam and handwritten personally by you (from two sides). You will have 2 questions from [https://cloud.mail.ru/public/Mmar/uHxTPtWnQ '''question list'''] with 20 minutes for preparation and may receive additional questions or tasks.<br />
<br />
We are having serious time limits, so come at your seminar or earlier seminar.<br />
<br />
<br />
== Course Schedule (4th module)==<br />
===Seminars===<br />
'''Dates: Mondays (01.04, 08.04, 15.04, 22.04, 13.05, 20.05, 27.05, 03.06, 10.06)'''<br />
* Group BPI-161, 9:00-10:30, Room 501<br />
* Group BPI-162, 10:30-11:50, Room 311<br />
* Group BPI-163, 12:10-13:30, Room 311<br />
<br />
===Lectures===<br />
'''Dates: Tuesdays (02.04, 09.04, 16.04, 23.04, 14.05, 21.05, 28.05, 04.06)'''<br />
* 9:00-10:20, Room 317<br />
04.06 - Room 402<br />
<br />
[https://docs.google.com/spreadsheets/d/1pLN757-mq19G58qTs6wkdxNL9ZA_rik-eNi5M7AEFh4/edit#gid=2055150791 Complete Schedule of Software Engineering]<br />
<br />
== Lecture materials ==<br />
<br />
'''Lecture 1. Introduction to data science and machine learning ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l1-intro/lecture-intro.slides#/ Slides] <br/><br />
<br />
'''Lecture 2. Cross-validation. Metric-based models. KNN ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l2-metric/lecture-knn.slides#/ Slides] <br/><br />
<br />
'''Lecture 3. Decision Trees ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l3-trees/lecture-trees.slides#/ Slides] <br/><br />
<br />
'''Lecture 4. Linear Regression, Gradient-based optimization ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l4-linreg/lecture-linreg.slides#/ Slides] <br/><br />
<br />
'''Lecture 5. Regularization, Linear Classification ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l5-linclass/lecture-linclass.slides#/ Slides] <br/><br />
<br />
'''Lecture 6. Supervised Quality Measures ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l6-quality/lecture-metrics.slides#/ Slides] <br/><br />
<br />
'''Lecture 7. Support Vector Machines. Kernel Trick ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l7-svm/lecture-svm.slides#/ Slides] <br/><br />
<br />
'''Lecture 8. Feature Selection. Dimension Reduction. PCA ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l8-pca/lecture-pca.slides#/ Slides] <br/><br />
<br />
'''Lecture 9. Ensembles ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l9-ensembles/lecture-ensemble.slides#/ Slides] <br/><br />
<br />
'''Lecture 10. Boosting ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l10-boosting/lecture-boosting.slides#/ Slides] <br/><br />
<br />
'''Lecture 11. Neural Networks 1 ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l11-nn1/lecture-nn1.slides#/ Slides] <br/><br />
<br />
'''Lecture 12. Neural Networks 2 ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l12-nn2/lecture-nn2.slides#/ Slides] <br/><br />
<br />
'''Lecture 13. Introduction to NLP ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l13-nlp-intro/lecture-nlp-intro.slides#/ Slides] <br/><br />
<br />
'''Lecture 14. Clustering ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l14-clust/lecture-clust.slides#/ Slides] <br/><br />
<br />
== Seminars ==<br />
<br />
'''Seminar 1. Introduction to Data Analysis in Python '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2019/s1-intro Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s1-intro/seminar1-homework.ipynb Practical task 1], [https://www.dropbox.com/request/Ct57iiKQNfoU3CLw21UJ upload link], '''Due Date: 29.01.2019 23:59''' <br/><br />
Additional materials: [https://github.com/esokolov/ml-course-hse/blob/master/2016-fall/seminars/sem01-tools.ipynb 1], [https://drive.google.com/open?id=0B7TWwiIrcJstRzVRSlRFcEl3VGM 2]<br />
<br />
'''Seminar 2. Metric-based methods '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s2-metric/seminar2-knn.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s2-metric/knn_theory.pdf Theoretical task 1] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s2-metric/seminar2-homework.ipynb Practical task 2] <br/><br />
<br />
'''Seminar 3. Decision Trees '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s3-trees/seminar3-trees.ipynb Practice in class], [https://github.com/shestakoff/hse_se_ml/tree/master/2019/s3-trees titanic.csv] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s3-trees/trees_theory.pdf Theoretical task 2] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s3-trees/seminar3-homework.ipynb Practical task 3], [https://github.com/shestakoff/hse_se_ml/tree/master/2019/s3-trees data.csv], [https://www.dropbox.com/request/KYxV6H91zVqWfb73SqW4 upload link] '''Due Date: 19.02.2019 23:59''' <br/><br />
<br />
'''Seminar 4. Linear Regression '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s4-linreg/seminar4-linreg.ipynb Practice in class], [https://github.com/shestakoff/hse_se_ml/blob/master/2019/s4-linreg/dataset.csv dataset.csv] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s4-linreg/linreg_theory.pdf Theoretical task 3], [https://www.dropbox.com/request/zo7nHHiPt3qQSmxbwPZ4 upload link] '''Due Date: 24.02.2019 23:59''' <br/><br />
<br />
'''Seminar 5. Linear Classification '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s5-logreg/seminar5-logreg.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s5-logreg/seminar5-homework.ipynb Practical task 4], [https://github.com/shestakoff/hse_se_ml/blob/master/2019/s5-logreg/audit_data/audit_risk.csv audit_risk.csv], [https://www.dropbox.com/request/ZKvZFQm4RUnCLr4vxV2k upload link] '''Due Date: 10.03.2019 23:59''' <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s5-logreg/linclass_theory.pdf Theoretical task 4], [https://www.dropbox.com/request/BPCi4lU8DLGGmtHClXhp upload link] '''Due Date: 04.03.2019 23:59''' <br/><br />
<br />
'''Seminar 6. Supervised quality measures '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s6-quality/seminar6-quality.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s6-quality/metrics_svm.pdf Theoretical task 5], [https://www.dropbox.com/request/H9N6AjI13sowOWubZTdp upload link] '''Due Date: 25.03.2019 23:59''' <br/><br />
<br />
'''Seminar 8. Feature Selection. Dimension Reduction. PCA '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s8-pca/seminar8-pca.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s8-pca/seminar8-homework.ipynb Practical task 5], [https://github.com/shestakoff/hse_se_ml/blob/master/2019/s8-pca/data/voice.csv voice.csv] [https://www.dropbox.com/request/cobhQeuESjkVtzYbg4Yl upload link] '''Due Date: 21.04.2019 23:59''' <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s8-pca/pca_theory.pdf Theoretical task 6], [https://www.dropbox.com/request/ZaSfptNsUuWyB27fvsXG upload link] '''Due Date: 15.04.2019 23:59''' <br/><br />
<br />
'''Seminar 9. Ensembles '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s9-ensembles/seminar9-ensembles.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s9-ensembles/ensembles_theory.pdf Theoretical task 7] [https://www.dropbox.com/request/WxpJZAZGx9eIo5HVMbBR upload link] '''Due Date: 25.04.2019 23:59''' <br/><br />
<br />
'''Seminar 10. Boosting '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s10-boosting/seminar.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s10-boosting/boosting-theory.pdf Theoretical task 8] [https://www.dropbox.com/request/yDTwmOcDu3jD14CAtlx7 upload link] '''Due Date: 30.04.2019 23:59''' <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s10-boosting/seminar10-homework.ipynb Practical task 6], [https://www.dropbox.com/request/o1JBiKFMGj2Bv6vOqtJ8 upload link] '''Extended Due Date: 17.05.2019 23:59''' <br/><br />
<br />
'''Seminar 11. Neural Networks 1 '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s11-nn1/seminar11-nn1.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s11-nn1/TT9.pdf Theoretical task 9] [https://www.dropbox.com/request/jAwIMNTOaCKdN07q5qTm upload link] '''Due Date: 21.05.2019 23:59''' <br/><br />
<br />
'''Seminar 12. Neural Networks 2 '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s12-nn2/seminar12-nn2.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s12-nn2/seminar12-homework.ipynb Practical task 7] [https://www.dropbox.com/request/iehLVStpn30RchYr5nMC upload link] '''Due Date: 07.06.2019 23:59''' <br/><br />
<br />
'''Seminar 13. Intro to Kaggle and NLP '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s13-nlp-intro/seminar13-nlp-intro.ipynb Practice in class]<br/><br />
<br />
'''Seminar 14. Clustering '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s14-clustering/seminar14-clustering.ipynb Practice in class]<br/><br />
<br />
<br/><br />
<br />
'''To ease examination process for our course assistants, please, put your subgroup number in the beginning of solution filenames''' <br/><br />
Example: 165-1-shestakov-andrey.ipynb <br/><br />
<br />
== Evaluation criteria ==<br />
The course lasts during the 3rd and 4th modules. Knowledge of students is assessed by evaluation of their home assignments and exams. Home assignments divide into theoretical tasks and practical tasks. There are two exams during the course – after the 3rd module and after the 4th module respectively. Each of the exams evaluates theoretical knowledge and understanding of the material studied during the respective module.<br />
<br />
Grade takes values 4,5,…10. Grades, corresponding to 1,2,3 are assumed unsatisfactory. Exact grades are calculated using the following rule:<br />
<br />
* '''score''' ≥ 35% => 4,<br />
* '''score''' ≥ 45% => 5,<br />
* ...<br />
* '''score''' ≥ 95% => 10,<br />
<br />
where '''score''' is calculated using the following rule:<br />
<br />
'''score''' = 0.7 * S<sub>cumulative</sub> + 0.3 * S<sub>exam2</sub> <br \><br />
'''cumulative score''' = 0.8 * S<sub>homework</sub> + 0.2 * S<sub>exam1</sub> + 0.2 * S<sub>competition</sub><br />
<br />
* S<sub>homework</sub> – proportion of correctly solved homework,<br />
* S<sub>exam1</sub> – proportion of successfully answered theoretical questions during exam after module 3,<br />
* S<sub>exam2</sub> – proportion of successfully answered theoretical questions during exam after module 4,<br />
* S<sub>competition</sub> – score for the competition in machine learning (it's also from 0 to 1).<br />
<br />
Participation in machine learning competition is optional and can give students extra points. <br \><br />
"Automative" passing of the course based on '''cumulative score''' ''may'' be issued.<br />
<br />
== Plagiarism ==<br />
In case of discovered plagiarism zero points will be set for the home assignemets - for both works, which were found to be identical. In case of repeated plagiarism by one and the same person a report to the dean will be made.<br />
<br />
== Deadlines ==<br />
<br />
Assignments sent after late deadlines will not be scored (assigned with zero score) in the absence of legitimate reasons for late submission which do not include high load on other classes. <br />
<br />
== Structure of emails and homework submissions ==<br />
Practical assignments must be implemented in jupyter notebook format, theoretical ones in pdf. Practical assignments must use '''Python 3''' (or Python 3 compatible). Use your surname as a filename for assignments (e.g. Ivanov.ipynb). Do not archive your assignments.<br />
<br />
Assignments can be performed in either Russian or English.<br />
<br />
'''Assignments can be submitted only once!'''<br />
<br />
== Useful links ==<br />
=== Machine learning, Stats, Maths ===<br />
* [https://github.com/esokolov/ml-course-hse Machine learning course from Evgeny Sokolov on Github]<br />
* [http://www.machinelearning.ru/wiki/index.php?title=Заглавная_страница machinelearning.ru]<br />
* [https://yandexdataschool.ru/edu-process/courses/machine-learning Video-lectures of K. Vorontsov on machine learning]<br />
* [https://www.dataoptimal.com/wp-content/uploads/Data-Science-Books-for-2018.pdf Some books for ML1]<br />
* [https://anvaka.github.io/greview/hands-on-ml/1/ Some books for ML2]<br />
* [https://mml-book.github.io/ Math for ML]<br />
* One of classic ML books. [https://web.stanford.edu/~hastie/Papers/ESLII.pdf Elements of Statistical Learning (Trevor Hastie, Robert Tibshirani, Jerome Friedman)]<br />
* [http://immersivemath.com/ila/learnmore.html Linear Algebra Immersive book]<br />
<br />
=== Python ===<br />
* [http://python.org Official website]<br />
* Libraries: [http://www.numpy.org/ NumPy], [http://pandas.pydata.org/ Pandas], [http://scikit-learn.org/stable/ SciKit-Learn], [http://matplotlib.org/ Matplotlib].<br />
* A little example for the begginers: [http://nbviewer.ipython.org/gist/voron13e02/83a86f2e0fc5e7f8424d краткое руководство с примерами по Python 2]<br />
* Python from scratch: [http://nbviewer.ipython.org/gist/rpmuller/5920182 A Crash Course in Python for Scientists]<br />
* Lectures [https://github.com/jrjohansson/scientific-python-lectures#online-read-only-versions Scientific Python]<br />
* A book: [http://www.cin.ufpe.br/~embat/Python%20for%20Data%20Analysis.pdf Wes McKinney «Python for Data Analysis»]<br />
* [https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks Коллекция интересных IPython ноутбуков]<br />
<br />
=== Python installation and configuration ===<br />
<br />
[https://www.continuum.io/downloads anaconda]</div>Mhushchynhttp://wiki.cs.hse.ru/Data_analysis_(Software_Engineering)_2019Data analysis (Software Engineering) 20192019-04-14T18:43:49Z<p>Mhushchyn: </p>
<hr />
<div>'''[https://docs.google.com/spreadsheets/d/1qKJtHeqXeTrDMlzxWXORiaTbUBM1QMd2DJjpA1eshm8/edit?usp=sharing Scores]''' <br /><br />
'''[https://join.slack.com/t/hse-se-ml/shared_invite/enQtNTc4NzUzODIwMjI0LTlhYTQxYmQxZmI5NTE4NDY0MjdlMWNjZTJhMzdlZDUzNmJhZWYyZmRkOTY0Zjc3NDE1OWMwOWEzOTdmNTI3YmE Slack Invite Link] <br /><br />
'''Anonymous feedback form:''' [https://goo.gl/forms/xTfnM328m8ulT4FF2 here]<br /><br />
'''[[ Data_analysis_(Software_Engineering)_2018 | Previous Course Page ]]''' <br /><br />
'''[https://github.com/shestakoff/hse_se_ml Course repo]<br/><br />
<br />
<br />
<br /><br />
<br />
== Course description ==<br />
In this class we consider the main problems of data mining and machine learning: classification, clustering, regression, dimensionality reduction, ranking, collaborative filtering. We will also study mathematical methods and concepts which data analysis is based on as well as formal assumptions behind them and various aspects of their implementation.<br />
<br />
A significant attention is given to practical skills of data analysis that will be developed on seminars by studying the Python programming language and relevant libraries for scientific computing.<br />
<br />
The knowledge of linear algebra, real analysis and probability theory is required.<br />
<br />
'''The class consists of:'''<br />
# Lectures and seminars<br />
# Practical and theoretical homework assignments<br />
# A machine learning competition (more information will be available later)<br />
# Midterm theoretical colloquium<br />
# Final exam<br />
<br />
== Colloquium ==<br />
Colloquium will be held on the '''1th and 2nd of April''' during seminars and lecture<br />
<br />
You may not use any materials during colloquium, except single A4 prepared before the exam and handwritten personally by you (from two sides). You will have 2 questions from [https://cloud.mail.ru/public/Mmar/uHxTPtWnQ '''question list'''] with 20 minutes for preparation and may receive additional questions or tasks.<br />
<br />
We are having serious time limits, so come at your seminar or earlier seminar.<br />
<br />
<br />
== Course Schedule (3rd module)==<br />
===Seminars===<br />
'''Dates: Mondays (01.04, 08.04, 15.04, 22.04, 13.05, 20.05, 27.05, 10.06)'''<br />
* Group BPI-161, 9:00-10:30, Room 501<br />
* Group BPI-162, 10:30-11:50, Room 311<br />
* Group BPI-163, 12:10-13:30, Room 311<br />
<br />
===Lectures===<br />
'''Dates: Tuesdays (02.04, 09.04, 16.04, 23.04, 14.05, 21.05, 28.05, 11.06)'''<br />
* 9:00-10:20, Room 317<br />
<br />
[https://docs.google.com/spreadsheets/d/1pLN757-mq19G58qTs6wkdxNL9ZA_rik-eNi5M7AEFh4/edit#gid=2055150791 Complete Schedule of Software Engineering]<br />
<br />
== Lecture materials ==<br />
<br />
'''Lecture 1. Introduction to data science and machine learning ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l1-intro/lecture-intro.slides#/ Slides] <br/><br />
<br />
'''Lecture 2. Cross-validation. Metric-based models. KNN ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l2-metric/lecture-knn.slides#/ Slides] <br/><br />
<br />
'''Lecture 3. Decision Trees ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l3-trees/lecture-trees.slides#/ Slides] <br/><br />
<br />
'''Lecture 4. Linear Regression, Gradient-based optimization ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l4-linreg/lecture-linreg.slides#/ Slides] <br/><br />
<br />
'''Lecture 5. Regularization, Linear Classification ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l5-linclass/lecture-linclass.slides#/ Slides] <br/><br />
<br />
'''Lecture 6. Supervised Quality Measures ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l6-quality/lecture-metrics.slides#/ Slides] <br/><br />
<br />
'''Lecture 7. Support Vector Machines. Kernel Trick ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l7-svm/lecture-svm.slides#/ Slides] <br/><br />
<br />
'''Lecture 8. Feature Selection. Dimension Reduction. PCA ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l8-pca/lecture-pca.slides#/ Slides] <br/><br />
<br />
'''Lecture 9. Ensembles ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l9-ensembles/lecture-ensemble.slides#/ Slides] <br/><br />
<br />
== Seminars ==<br />
<br />
'''Seminar 1. Introduction to Data Analysis in Python '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2019/s1-intro Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s1-intro/seminar1-homework.ipynb Practical task 1], [https://www.dropbox.com/request/Ct57iiKQNfoU3CLw21UJ upload link], '''Due Date: 29.01.2019 23:59''' <br/><br />
Additional materials: [https://github.com/esokolov/ml-course-hse/blob/master/2016-fall/seminars/sem01-tools.ipynb 1], [https://drive.google.com/open?id=0B7TWwiIrcJstRzVRSlRFcEl3VGM 2]<br />
<br />
'''Seminar 2. Metric-based methods '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s2-metric/seminar2-knn.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s2-metric/knn_theory.pdf Theoretical task 1] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s2-metric/seminar2-homework.ipynb Practical task 2] <br/><br />
<br />
'''Seminar 3. Decision Trees '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s3-trees/seminar3-trees.ipynb Practice in class], [https://github.com/shestakoff/hse_se_ml/tree/master/2019/s3-trees titanic.csv] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s3-trees/trees_theory.pdf Theoretical task 2] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s3-trees/seminar3-homework.ipynb Practical task 3], [https://github.com/shestakoff/hse_se_ml/tree/master/2019/s3-trees data.csv], [https://www.dropbox.com/request/KYxV6H91zVqWfb73SqW4 upload link] '''Due Date: 19.02.2019 23:59''' <br/><br />
<br />
'''Seminar 4. Linear Regression '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s4-linreg/seminar4-linreg.ipynb Practice in class], [https://github.com/shestakoff/hse_se_ml/blob/master/2019/s4-linreg/dataset.csv dataset.csv] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s4-linreg/linreg_theory.pdf Theoretical task 3], [https://www.dropbox.com/request/zo7nHHiPt3qQSmxbwPZ4 upload link] '''Due Date: 24.02.2019 23:59''' <br/><br />
<br />
'''Seminar 5. Linear Classification '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s5-logreg/seminar5-logreg.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s5-logreg/seminar5-homework.ipynb Practical task 4], [https://github.com/shestakoff/hse_se_ml/blob/master/2019/s5-logreg/audit_data/audit_risk.csv audit_risk.csv], [https://www.dropbox.com/request/ZKvZFQm4RUnCLr4vxV2k upload link] '''Due Date: 10.03.2019 23:59''' <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s5-logreg/linclass_theory.pdf Theoretical task 4], [https://www.dropbox.com/request/BPCi4lU8DLGGmtHClXhp upload link] '''Due Date: 04.03.2019 23:59''' <br/><br />
<br />
'''Seminar 6. Supervised quality measures '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s6-quality/seminar6-quality.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s6-quality/metrics_svm.pdf Theoretical task 5], [https://www.dropbox.com/request/H9N6AjI13sowOWubZTdp upload link] '''Due Date: 25.03.2019 23:59''' <br/><br />
<br />
'''Seminar 8. Feature Selection. Dimension Reduction. PCA '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s8-pca/seminar8-pca.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s8-pca/seminar8-homework.ipynb Practical task 5], [https://github.com/shestakoff/hse_se_ml/blob/master/2019/s8-pca/data/voice.csv voice.csv] [https://www.dropbox.com/request/cobhQeuESjkVtzYbg4Yl upload link] '''Due Date: 21.04.2019 23:59''' <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s8-pca/pca_theory.pdf Theoretical task 6], [https://www.dropbox.com/request/ZaSfptNsUuWyB27fvsXG upload link] '''Due Date: 15.04.2019 23:59''' <br/><br />
<br />
'''Seminar 9. Ensembles '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s9-ensembles/seminar9-ensembles.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s9-ensembles/ensembles_theory.pdf Theoretical task 7] <br/><br />
<br />
<br/><br />
<br />
<br/><br />
<br />
'''To ease examination process for our course assistants, please, put your subgroup number in the beginning of solution filenames''' <br/><br />
Example: 165-1-shestakov-andrey.ipynb <br/><br />
<br />
== Evaluation criteria ==<br />
The course lasts during the 3rd and 4th modules. Knowledge of students is assessed by evaluation of their home assignments and exams. Home assignments divide into theoretical tasks and practical tasks. There are two exams during the course – after the 3rd module and after the 4th module respectively. Each of the exams evaluates theoretical knowledge and understanding of the material studied during the respective module.<br />
<br />
Grade takes values 4,5,…10. Grades, corresponding to 1,2,3 are assumed unsatisfactory. Exact grades are calculated using the following rule:<br />
<br />
* '''score''' ≥ 35% => 4,<br />
* '''score''' ≥ 45% => 5,<br />
* ...<br />
* '''score''' ≥ 95% => 10,<br />
<br />
where '''score''' is calculated using the following rule:<br />
<br />
'''score''' = 0.7 * S<sub>cumulative</sub> + 0.3 * S<sub>exam2</sub> <br \><br />
'''cumulative score''' = 0.8 * S<sub>homework</sub> + 0.2 * S<sub>exam1</sub> + 0.2 * S<sub>competition</sub><br />
<br />
* S<sub>homework</sub> – proportion of correctly solved homework,<br />
* S<sub>exam1</sub> – proportion of successfully answered theoretical questions during exam after module 3,<br />
* S<sub>exam2</sub> – proportion of successfully answered theoretical questions during exam after module 4,<br />
* S<sub>competition</sub> – score for the competition in machine learning (it's also from 0 to 1).<br />
<br />
Participation in machine learning competition is optional and can give students extra points. <br \><br />
"Automative" passing of the course based on '''cumulative score''' ''may'' be issued.<br />
<br />
== Plagiarism ==<br />
In case of discovered plagiarism zero points will be set for the home assignemets - for both works, which were found to be identical. In case of repeated plagiarism by one and the same person a report to the dean will be made.<br />
<br />
== Deadlines ==<br />
<br />
Assignments sent after late deadlines will not be scored (assigned with zero score) in the absence of legitimate reasons for late submission which do not include high load on other classes. <br />
<br />
== Structure of emails and homework submissions ==<br />
Practical assignments must be implemented in jupyter notebook format, theoretical ones in pdf. Practical assignments must use '''Python 3''' (or Python 3 compatible). Use your surname as a filename for assignments (e.g. Ivanov.ipynb). Do not archive your assignments.<br />
<br />
Assignments can be performed in either Russian or English.<br />
<br />
'''Assignments can be submitted only once!'''<br />
<br />
== Useful links ==<br />
=== Machine learning, Stats, Maths ===<br />
* [https://github.com/esokolov/ml-course-hse Machine learning course from Evgeny Sokolov on Github]<br />
* [http://www.machinelearning.ru/wiki/index.php?title=Заглавная_страница machinelearning.ru]<br />
* [https://yandexdataschool.ru/edu-process/courses/machine-learning Video-lectures of K. Vorontsov on machine learning]<br />
* [https://www.dataoptimal.com/wp-content/uploads/Data-Science-Books-for-2018.pdf Some books for ML1]<br />
* [https://anvaka.github.io/greview/hands-on-ml/1/ Some books for ML2]<br />
* [https://mml-book.github.io/ Math for ML]<br />
* One of classic ML books. [https://web.stanford.edu/~hastie/Papers/ESLII.pdf Elements of Statistical Learning (Trevor Hastie, Robert Tibshirani, Jerome Friedman)]<br />
* [http://immersivemath.com/ila/learnmore.html Linear Algebra Immersive book]<br />
<br />
=== Python ===<br />
* [http://python.org Official website]<br />
* Libraries: [http://www.numpy.org/ NumPy], [http://pandas.pydata.org/ Pandas], [http://scikit-learn.org/stable/ SciKit-Learn], [http://matplotlib.org/ Matplotlib].<br />
* A little example for the begginers: [http://nbviewer.ipython.org/gist/voron13e02/83a86f2e0fc5e7f8424d краткое руководство с примерами по Python 2]<br />
* Python from scratch: [http://nbviewer.ipython.org/gist/rpmuller/5920182 A Crash Course in Python for Scientists]<br />
* Lectures [https://github.com/jrjohansson/scientific-python-lectures#online-read-only-versions Scientific Python]<br />
* A book: [http://www.cin.ufpe.br/~embat/Python%20for%20Data%20Analysis.pdf Wes McKinney «Python for Data Analysis»]<br />
* [https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks Коллекция интересных IPython ноутбуков]<br />
<br />
=== Python installation and configuration ===<br />
<br />
[https://www.continuum.io/downloads anaconda]</div>Mhushchynhttp://wiki.cs.hse.ru/Data_analysis_(Software_Engineering)_2019Data analysis (Software Engineering) 20192019-04-07T16:24:51Z<p>Mhushchyn: </p>
<hr />
<div>'''[https://docs.google.com/spreadsheets/d/1qKJtHeqXeTrDMlzxWXORiaTbUBM1QMd2DJjpA1eshm8/edit?usp=sharing Scores]''' <br /><br />
'''[https://join.slack.com/t/hse-se-ml/shared_invite/enQtNTc4NzUzODIwMjI0LTlhYTQxYmQxZmI5NTE4NDY0MjdlMWNjZTJhMzdlZDUzNmJhZWYyZmRkOTY0Zjc3NDE1OWMwOWEzOTdmNTI3YmE Slack Invite Link] <br /><br />
'''Anonymous feedback form:''' [https://goo.gl/forms/xTfnM328m8ulT4FF2 here]<br /><br />
'''[[ Data_analysis_(Software_Engineering)_2018 | Previous Course Page ]]''' <br /><br />
'''[https://github.com/shestakoff/hse_se_ml Course repo]<br/><br />
<br />
<br />
<br /><br />
<br />
== Course description ==<br />
In this class we consider the main problems of data mining and machine learning: classification, clustering, regression, dimensionality reduction, ranking, collaborative filtering. We will also study mathematical methods and concepts which data analysis is based on as well as formal assumptions behind them and various aspects of their implementation.<br />
<br />
A significant attention is given to practical skills of data analysis that will be developed on seminars by studying the Python programming language and relevant libraries for scientific computing.<br />
<br />
The knowledge of linear algebra, real analysis and probability theory is required.<br />
<br />
'''The class consists of:'''<br />
# Lectures and seminars<br />
# Practical and theoretical homework assignments<br />
# A machine learning competition (more information will be available later)<br />
# Midterm theoretical colloquium<br />
# Final exam<br />
<br />
== Colloquium ==<br />
Colloquium will be held on the '''1th and 2nd of April''' during seminars and lecture<br />
<br />
You may not use any materials during colloquium, except single A4 prepared before the exam and handwritten personally by you (from two sides). You will have 2 questions from [https://cloud.mail.ru/public/Mmar/uHxTPtWnQ '''question list'''] with 20 minutes for preparation and may receive additional questions or tasks.<br />
<br />
We are having serious time limits, so come at your seminar or earlier seminar.<br />
<br />
<br />
== Course Schedule (3rd module)==<br />
===Seminars===<br />
'''Dates: Mondays (01.04, 08.04, 15.04, 22.04, 13.05, 20.05, 27.05, 10.06)'''<br />
* Group BPI-161, 9:00-10:30, Room 501<br />
* Group BPI-162, 10:30-11:50, Room 311<br />
* Group BPI-163, 12:10-13:30, Room 311<br />
<br />
===Lectures===<br />
'''Dates: Tuesdays (02.04, 09.04, 16.04, 23.04, 14.05, 21.05, 28.05, 11.06)'''<br />
* 9:00-10:20, Room 317<br />
<br />
[https://docs.google.com/spreadsheets/d/1pLN757-mq19G58qTs6wkdxNL9ZA_rik-eNi5M7AEFh4/edit#gid=2055150791 Complete Schedule of Software Engineering]<br />
<br />
== Lecture materials ==<br />
<br />
'''Lecture 1. Introduction to data science and machine learning ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l1-intro/lecture-intro.slides#/ Slides] <br/><br />
<br />
'''Lecture 2. Cross-validation. Metric-based models. KNN ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l2-metric/lecture-knn.slides#/ Slides] <br/><br />
<br />
'''Lecture 3. Decision Trees ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l3-trees/lecture-trees.slides#/ Slides] <br/><br />
<br />
'''Lecture 4. Linear Regression, Gradient-based optimization ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l4-linreg/lecture-linreg.slides#/ Slides] <br/><br />
<br />
'''Lecture 5. Regularization, Linear Classification ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l5-linclass/lecture-linclass.slides#/ Slides] <br/><br />
<br />
'''Lecture 6. Supervised Quality Measures ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l6-quality/lecture-metrics.slides#/ Slides] <br/><br />
<br />
'''Lecture 7. Support Vector Machines. Kernel Trick ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l7-svm/lecture-svm.slides#/ Slides] <br/><br />
<br />
'''Lecture 8. Feature Selection. Dimension Reduction. PCA ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l8-pca/lecture-pca.slides#/ Slides] <br/><br />
<br />
== Seminars ==<br />
<br />
'''Seminar 1. Introduction to Data Analysis in Python '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2019/s1-intro Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s1-intro/seminar1-homework.ipynb Practical task 1], [https://www.dropbox.com/request/Ct57iiKQNfoU3CLw21UJ upload link], '''Due Date: 29.01.2019 23:59''' <br/><br />
Additional materials: [https://github.com/esokolov/ml-course-hse/blob/master/2016-fall/seminars/sem01-tools.ipynb 1], [https://drive.google.com/open?id=0B7TWwiIrcJstRzVRSlRFcEl3VGM 2]<br />
<br />
'''Seminar 2. Metric-based methods '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s2-metric/seminar2-knn.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s2-metric/knn_theory.pdf Theoretical task 1] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s2-metric/seminar2-homework.ipynb Practical task 2] <br/><br />
<br />
'''Seminar 3. Decision Trees '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s3-trees/seminar3-trees.ipynb Practice in class], [https://github.com/shestakoff/hse_se_ml/tree/master/2019/s3-trees titanic.csv] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s3-trees/trees_theory.pdf Theoretical task 2] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s3-trees/seminar3-homework.ipynb Practical task 3], [https://github.com/shestakoff/hse_se_ml/tree/master/2019/s3-trees data.csv], [https://www.dropbox.com/request/KYxV6H91zVqWfb73SqW4 upload link] '''Due Date: 19.02.2019 23:59''' <br/><br />
<br />
'''Seminar 4. Linear Regression '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s4-linreg/seminar4-linreg.ipynb Practice in class], [https://github.com/shestakoff/hse_se_ml/blob/master/2019/s4-linreg/dataset.csv dataset.csv] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s4-linreg/linreg_theory.pdf Theoretical task 3], [https://www.dropbox.com/request/zo7nHHiPt3qQSmxbwPZ4 upload link] '''Due Date: 24.02.2019 23:59''' <br/><br />
<br />
'''Seminar 5. Linear Classification '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s5-logreg/seminar5-logreg.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s5-logreg/seminar5-homework.ipynb Practical task 4], [https://github.com/shestakoff/hse_se_ml/blob/master/2019/s5-logreg/audit_data/audit_risk.csv audit_risk.csv], [https://www.dropbox.com/request/ZKvZFQm4RUnCLr4vxV2k upload link] '''Due Date: 10.03.2019 23:59''' <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s5-logreg/linclass_theory.pdf Theoretical task 4], [https://www.dropbox.com/request/BPCi4lU8DLGGmtHClXhp upload link] '''Due Date: 04.03.2019 23:59''' <br/><br />
<br />
'''Seminar 6. Supervised quality measures '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s6-quality/seminar6-quality.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s6-quality/metrics_svm.pdf Theoretical task 5], [https://www.dropbox.com/request/H9N6AjI13sowOWubZTdp upload link] '''Due Date: 25.03.2019 23:59''' <br/><br />
<br />
'''Seminar 8. Feature Selection. Dimension Reduction. PCA '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s8-pca/seminar8-pca.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s8-pca/seminar8-homework.ipynb Practical task 5], [https://github.com/shestakoff/hse_se_ml/blob/master/2019/s8-pca/data/voice.csv voice.csv] <br/><br />
<br />
<br/><br />
<br />
<br/><br />
<br />
'''To ease examination process for our course assistants, please, put your subgroup number in the beginning of solution filenames''' <br/><br />
Example: 165-1-shestakov-andrey.ipynb <br/><br />
<br />
== Evaluation criteria ==<br />
The course lasts during the 3rd and 4th modules. Knowledge of students is assessed by evaluation of their home assignments and exams. Home assignments divide into theoretical tasks and practical tasks. There are two exams during the course – after the 3rd module and after the 4th module respectively. Each of the exams evaluates theoretical knowledge and understanding of the material studied during the respective module.<br />
<br />
Grade takes values 4,5,…10. Grades, corresponding to 1,2,3 are assumed unsatisfactory. Exact grades are calculated using the following rule:<br />
<br />
* '''score''' ≥ 35% => 4,<br />
* '''score''' ≥ 45% => 5,<br />
* ...<br />
* '''score''' ≥ 95% => 10,<br />
<br />
where '''score''' is calculated using the following rule:<br />
<br />
'''score''' = 0.7 * S<sub>cumulative</sub> + 0.3 * S<sub>exam2</sub> <br \><br />
'''cumulative score''' = 0.8 * S<sub>homework</sub> + 0.2 * S<sub>exam1</sub> + 0.2 * S<sub>competition</sub><br />
<br />
* S<sub>homework</sub> – proportion of correctly solved homework,<br />
* S<sub>exam1</sub> – proportion of successfully answered theoretical questions during exam after module 3,<br />
* S<sub>exam2</sub> – proportion of successfully answered theoretical questions during exam after module 4,<br />
* S<sub>competition</sub> – score for the competition in machine learning (it's also from 0 to 1).<br />
<br />
Participation in machine learning competition is optional and can give students extra points. <br \><br />
"Automative" passing of the course based on '''cumulative score''' ''may'' be issued.<br />
<br />
== Plagiarism ==<br />
In case of discovered plagiarism zero points will be set for the home assignemets - for both works, which were found to be identical. In case of repeated plagiarism by one and the same person a report to the dean will be made.<br />
<br />
== Deadlines ==<br />
<br />
Assignments sent after late deadlines will not be scored (assigned with zero score) in the absence of legitimate reasons for late submission which do not include high load on other classes. <br />
<br />
== Structure of emails and homework submissions ==<br />
Practical assignments must be implemented in jupyter notebook format, theoretical ones in pdf. Practical assignments must use '''Python 3''' (or Python 3 compatible). Use your surname as a filename for assignments (e.g. Ivanov.ipynb). Do not archive your assignments.<br />
<br />
Assignments can be performed in either Russian or English.<br />
<br />
'''Assignments can be submitted only once!'''<br />
<br />
== Useful links ==<br />
=== Machine learning, Stats, Maths ===<br />
* [https://github.com/esokolov/ml-course-hse Machine learning course from Evgeny Sokolov on Github]<br />
* [http://www.machinelearning.ru/wiki/index.php?title=Заглавная_страница machinelearning.ru]<br />
* [https://yandexdataschool.ru/edu-process/courses/machine-learning Video-lectures of K. Vorontsov on machine learning]<br />
* [https://www.dataoptimal.com/wp-content/uploads/Data-Science-Books-for-2018.pdf Some books for ML1]<br />
* [https://anvaka.github.io/greview/hands-on-ml/1/ Some books for ML2]<br />
* [https://mml-book.github.io/ Math for ML]<br />
* One of classic ML books. [https://web.stanford.edu/~hastie/Papers/ESLII.pdf Elements of Statistical Learning (Trevor Hastie, Robert Tibshirani, Jerome Friedman)]<br />
* [http://immersivemath.com/ila/learnmore.html Linear Algebra Immersive book]<br />
<br />
=== Python ===<br />
* [http://python.org Official website]<br />
* Libraries: [http://www.numpy.org/ NumPy], [http://pandas.pydata.org/ Pandas], [http://scikit-learn.org/stable/ SciKit-Learn], [http://matplotlib.org/ Matplotlib].<br />
* A little example for the begginers: [http://nbviewer.ipython.org/gist/voron13e02/83a86f2e0fc5e7f8424d краткое руководство с примерами по Python 2]<br />
* Python from scratch: [http://nbviewer.ipython.org/gist/rpmuller/5920182 A Crash Course in Python for Scientists]<br />
* Lectures [https://github.com/jrjohansson/scientific-python-lectures#online-read-only-versions Scientific Python]<br />
* A book: [http://www.cin.ufpe.br/~embat/Python%20for%20Data%20Analysis.pdf Wes McKinney «Python for Data Analysis»]<br />
* [https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks Коллекция интересных IPython ноутбуков]<br />
<br />
=== Python installation and configuration ===<br />
<br />
[https://www.continuum.io/downloads anaconda]</div>Mhushchynhttp://wiki.cs.hse.ru/Data_analysis_(Software_Engineering)_2019Data analysis (Software Engineering) 20192019-04-07T16:24:19Z<p>Mhushchyn: </p>
<hr />
<div>'''[https://docs.google.com/spreadsheets/d/1qKJtHeqXeTrDMlzxWXORiaTbUBM1QMd2DJjpA1eshm8/edit?usp=sharing Scores]''' <br /><br />
'''[https://join.slack.com/t/hse-se-ml/shared_invite/enQtNTc4NzUzODIwMjI0LTlhYTQxYmQxZmI5NTE4NDY0MjdlMWNjZTJhMzdlZDUzNmJhZWYyZmRkOTY0Zjc3NDE1OWMwOWEzOTdmNTI3YmE Slack Invite Link] <br /><br />
'''Anonymous feedback form:''' [https://goo.gl/forms/xTfnM328m8ulT4FF2 here]<br /><br />
'''[[ Data_analysis_(Software_Engineering)_2018 | Previous Course Page ]]''' <br /><br />
'''[https://github.com/shestakoff/hse_se_ml Course repo]<br/><br />
<br />
<br />
<br /><br />
<br />
== Course description ==<br />
In this class we consider the main problems of data mining and machine learning: classification, clustering, regression, dimensionality reduction, ranking, collaborative filtering. We will also study mathematical methods and concepts which data analysis is based on as well as formal assumptions behind them and various aspects of their implementation.<br />
<br />
A significant attention is given to practical skills of data analysis that will be developed on seminars by studying the Python programming language and relevant libraries for scientific computing.<br />
<br />
The knowledge of linear algebra, real analysis and probability theory is required.<br />
<br />
'''The class consists of:'''<br />
# Lectures and seminars<br />
# Practical and theoretical homework assignments<br />
# A machine learning competition (more information will be available later)<br />
# Midterm theoretical colloquium<br />
# Final exam<br />
<br />
== Colloquium ==<br />
Colloquium will be held on the '''1th and 2nd of April''' during seminars and lecture<br />
<br />
You may not use any materials during colloquium, except single A4 prepared before the exam and handwritten personally by you (from two sides). You will have 2 questions from [https://cloud.mail.ru/public/Mmar/uHxTPtWnQ '''question list'''] with 20 minutes for preparation and may receive additional questions or tasks.<br />
<br />
We are having serious time limits, so come at your seminar or earlier seminar.<br />
<br />
<br />
== Course Schedule (3rd module)==<br />
===Seminars===<br />
'''Dates: Mondays (01.04, 08.04, 15.04, 22.04, 13.05, 20.05, 27.05, 10.06)'''<br />
* Group BPI-161, 9:00-10:30, Room 501<br />
* Group BPI-162, 10:30-11:50, Room 311<br />
* Group BPI-163, 12:10-13:30, Room 311<br />
<br />
===Lectures===<br />
'''Dates: Tuesdays (02.04, 09.04, 16.04, 23.04, 14.05, 21.05, 28.05, 11.06)'''<br />
* 9:00-10:20, Room 317<br />
<br />
[https://docs.google.com/spreadsheets/d/1pLN757-mq19G58qTs6wkdxNL9ZA_rik-eNi5M7AEFh4/edit#gid=2055150791 Complete Schedule of Software Engineering]<br />
<br />
== Lecture materials ==<br />
<br />
'''Lecture 1. Introduction to data science and machine learning ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l1-intro/lecture-intro.slides#/ Slides] <br/><br />
<br />
'''Lecture 2. Cross-validation. Metric-based models. KNN ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l2-metric/lecture-knn.slides#/ Slides] <br/><br />
<br />
'''Lecture 3. Decision Trees ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l3-trees/lecture-trees.slides#/ Slides] <br/><br />
<br />
'''Lecture 4. Linear Regression, Gradient-based optimization ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l4-linreg/lecture-linreg.slides#/ Slides] <br/><br />
<br />
'''Lecture 5. Regularization, Linear Classification ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l5-linclass/lecture-linclass.slides#/ Slides] <br/><br />
<br />
'''Lecture 6. Supervised Quality Measures ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l6-quality/lecture-metrics.slides#/ Slides] <br/><br />
<br />
'''Lecture 7. Support Vector Machines. Kernel Trick ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l7-svm/lecture-svm.slides#/ Slides] <br/><br />
<br />
'''Lecture 8. Feature Selection. Dimension Reduction. PCA ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l8-pca/lecture-pca.slides#/ Slides] <br/><br />
<br />
== Seminars ==<br />
<br />
'''Seminar 1. Introduction to Data Analysis in Python '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2019/s1-intro Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s1-intro/seminar1-homework.ipynb Practical task 1], [https://www.dropbox.com/request/Ct57iiKQNfoU3CLw21UJ upload link], '''Due Date: 29.01.2019 23:59''' <br/><br />
Additional materials: [https://github.com/esokolov/ml-course-hse/blob/master/2016-fall/seminars/sem01-tools.ipynb 1], [https://drive.google.com/open?id=0B7TWwiIrcJstRzVRSlRFcEl3VGM 2]<br />
<br />
'''Seminar 2. Metric-based methods '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s2-metric/seminar2-knn.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s2-metric/knn_theory.pdf Theoretical task 1] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s2-metric/seminar2-homework.ipynb Practical task 2] <br/><br />
<br />
'''Seminar 3. Decision Trees '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s3-trees/seminar3-trees.ipynb Practice in class], [https://github.com/shestakoff/hse_se_ml/tree/master/2019/s3-trees titanic.csv] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s3-trees/trees_theory.pdf Theoretical task 2] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s3-trees/seminar3-homework.ipynb Practical task 3], [https://github.com/shestakoff/hse_se_ml/tree/master/2019/s3-trees data.csv], [https://www.dropbox.com/request/KYxV6H91zVqWfb73SqW4 upload link] '''Due Date: 19.02.2019 23:59''' <br/><br />
<br />
'''Seminar 4. Linear Regression '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s4-linreg/seminar4-linreg.ipynb Practice in class], [https://github.com/shestakoff/hse_se_ml/blob/master/2019/s4-linreg/dataset.csv dataset.csv] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s4-linreg/linreg_theory.pdf Theoretical task 3], [https://www.dropbox.com/request/zo7nHHiPt3qQSmxbwPZ4 upload link] '''Due Date: 24.02.2019 23:59''' <br/><br />
<br />
'''Seminar 5. Linear Classification '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s5-logreg/seminar5-logreg.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s5-logreg/seminar5-homework.ipynb Practical task 4], [https://github.com/shestakoff/hse_se_ml/blob/master/2019/s5-logreg/audit_data/audit_risk.csv audit_risk.csv], [https://www.dropbox.com/request/ZKvZFQm4RUnCLr4vxV2k upload link] '''Due Date: 10.03.2019 23:59''' <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s5-logreg/linclass_theory.pdf Theoretical task 4], [https://www.dropbox.com/request/BPCi4lU8DLGGmtHClXhp upload link] '''Due Date: 04.03.2019 23:59''' <br/><br />
<br />
'''Seminar 6. Supervised quality measures '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s6-quality/seminar6-quality.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s6-quality/metrics_svm.pdf Theoretical task 5], [https://www.dropbox.com/request/H9N6AjI13sowOWubZTdp upload link] '''Due Date: 25.03.2019 23:59''' <br/><br />
<br />
'''Seminar 8. Feature Selection. Dimension Reduction. PCA. '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s8-pca/seminar8-pca.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s8-pca/seminar8-homework.ipynb Practical task 5], [https://github.com/shestakoff/hse_se_ml/blob/master/2019/s8-pca/data/voice.csv voice.csv] <br/><br />
<br />
<br/><br />
<br />
<br/><br />
<br />
'''To ease examination process for our course assistants, please, put your subgroup number in the beginning of solution filenames''' <br/><br />
Example: 165-1-shestakov-andrey.ipynb <br/><br />
<br />
== Evaluation criteria ==<br />
The course lasts during the 3rd and 4th modules. Knowledge of students is assessed by evaluation of their home assignments and exams. Home assignments divide into theoretical tasks and practical tasks. There are two exams during the course – after the 3rd module and after the 4th module respectively. Each of the exams evaluates theoretical knowledge and understanding of the material studied during the respective module.<br />
<br />
Grade takes values 4,5,…10. Grades, corresponding to 1,2,3 are assumed unsatisfactory. Exact grades are calculated using the following rule:<br />
<br />
* '''score''' ≥ 35% => 4,<br />
* '''score''' ≥ 45% => 5,<br />
* ...<br />
* '''score''' ≥ 95% => 10,<br />
<br />
where '''score''' is calculated using the following rule:<br />
<br />
'''score''' = 0.7 * S<sub>cumulative</sub> + 0.3 * S<sub>exam2</sub> <br \><br />
'''cumulative score''' = 0.8 * S<sub>homework</sub> + 0.2 * S<sub>exam1</sub> + 0.2 * S<sub>competition</sub><br />
<br />
* S<sub>homework</sub> – proportion of correctly solved homework,<br />
* S<sub>exam1</sub> – proportion of successfully answered theoretical questions during exam after module 3,<br />
* S<sub>exam2</sub> – proportion of successfully answered theoretical questions during exam after module 4,<br />
* S<sub>competition</sub> – score for the competition in machine learning (it's also from 0 to 1).<br />
<br />
Participation in machine learning competition is optional and can give students extra points. <br \><br />
"Automative" passing of the course based on '''cumulative score''' ''may'' be issued.<br />
<br />
== Plagiarism ==<br />
In case of discovered plagiarism zero points will be set for the home assignemets - for both works, which were found to be identical. In case of repeated plagiarism by one and the same person a report to the dean will be made.<br />
<br />
== Deadlines ==<br />
<br />
Assignments sent after late deadlines will not be scored (assigned with zero score) in the absence of legitimate reasons for late submission which do not include high load on other classes. <br />
<br />
== Structure of emails and homework submissions ==<br />
Practical assignments must be implemented in jupyter notebook format, theoretical ones in pdf. Practical assignments must use '''Python 3''' (or Python 3 compatible). Use your surname as a filename for assignments (e.g. Ivanov.ipynb). Do not archive your assignments.<br />
<br />
Assignments can be performed in either Russian or English.<br />
<br />
'''Assignments can be submitted only once!'''<br />
<br />
== Useful links ==<br />
=== Machine learning, Stats, Maths ===<br />
* [https://github.com/esokolov/ml-course-hse Machine learning course from Evgeny Sokolov on Github]<br />
* [http://www.machinelearning.ru/wiki/index.php?title=Заглавная_страница machinelearning.ru]<br />
* [https://yandexdataschool.ru/edu-process/courses/machine-learning Video-lectures of K. Vorontsov on machine learning]<br />
* [https://www.dataoptimal.com/wp-content/uploads/Data-Science-Books-for-2018.pdf Some books for ML1]<br />
* [https://anvaka.github.io/greview/hands-on-ml/1/ Some books for ML2]<br />
* [https://mml-book.github.io/ Math for ML]<br />
* One of classic ML books. [https://web.stanford.edu/~hastie/Papers/ESLII.pdf Elements of Statistical Learning (Trevor Hastie, Robert Tibshirani, Jerome Friedman)]<br />
* [http://immersivemath.com/ila/learnmore.html Linear Algebra Immersive book]<br />
<br />
=== Python ===<br />
* [http://python.org Official website]<br />
* Libraries: [http://www.numpy.org/ NumPy], [http://pandas.pydata.org/ Pandas], [http://scikit-learn.org/stable/ SciKit-Learn], [http://matplotlib.org/ Matplotlib].<br />
* A little example for the begginers: [http://nbviewer.ipython.org/gist/voron13e02/83a86f2e0fc5e7f8424d краткое руководство с примерами по Python 2]<br />
* Python from scratch: [http://nbviewer.ipython.org/gist/rpmuller/5920182 A Crash Course in Python for Scientists]<br />
* Lectures [https://github.com/jrjohansson/scientific-python-lectures#online-read-only-versions Scientific Python]<br />
* A book: [http://www.cin.ufpe.br/~embat/Python%20for%20Data%20Analysis.pdf Wes McKinney «Python for Data Analysis»]<br />
* [https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks Коллекция интересных IPython ноутбуков]<br />
<br />
=== Python installation and configuration ===<br />
<br />
[https://www.continuum.io/downloads anaconda]</div>Mhushchynhttp://wiki.cs.hse.ru/Data_analysis_(Software_Engineering)_2019Data analysis (Software Engineering) 20192019-02-03T18:07:49Z<p>Mhushchyn: </p>
<hr />
<div>'''[https://docs.google.com/spreadsheets/d/1qKJtHeqXeTrDMlzxWXORiaTbUBM1QMd2DJjpA1eshm8/edit?usp=sharing Scores]''' <br /><br />
'''[https://join.slack.com/t/hse-se-ml/shared_invite/enQtNTMxODU4MTMwNjQyLTFlM2FkMmViNGI2YTg2ZmRjMjU5ZTEyMmFlYmU2NGVjN2U2YTAzNjAwZjhlYmUwNGFjNjJmYmY5MWVjNTNmZmY Slack Invite Link] <br /><br />
'''Anonymous feedback form:''' [https://goo.gl/forms/xTfnM328m8ulT4FF2 here]<br /><br />
'''[[ Data_analysis_(Software_Engineering)_2018 | Previous Course Page ]]''' <br /><br />
'''[https://github.com/shestakoff/hse_se_ml Course repo]<br/><br />
<br />
<br />
<br /><br />
<br />
== Course description ==<br />
In this class we consider the main problems of data mining and machine learning: classification, clustering, regression, dimensionality reduction, ranking, collaborative filtering. We will also study mathematical methods and concepts which data analysis is based on as well as formal assumptions behind them and various aspects of their implementation.<br />
<br />
A significant attention is given to practical skills of data analysis that will be developed on seminars by studying the Python programming language and relevant libraries for scientific computing.<br />
<br />
The knowledge of linear algebra, real analysis and probability theory is required.<br />
<br />
'''The class consists of:'''<br />
# Lectures and seminars<br />
# Practical and theoretical homework assignments<br />
# A machine learning competition (more information will be available later)<br />
# Midterm theoretical colloquium<br />
# Final exam<br />
<br />
== Course Schedule (3rd module)==<br />
===Seminars===<br />
'''Dates: Mondays (21.01, 28.01, 04.02, 11.02, 18.02, 25.02, 04.03, 11.03)'''<br />
* Group BPI-161, 10:30-11:50, Room 507<br />
* Group BPI-162, 12:10-13:30, Room 311<br />
* Group BPI-163, 13:40-15:00, Room 435<br />
<br />
===Lectures===<br />
'''Dates: Tuesdays (15.01, 22.01, 29.01, 05.02, 12.02, 19.02, 12.03, 19.03)'''<br />
* 9:00-10:20, Room 317<br />
<br />
[https://docs.google.com/spreadsheets/d/1pLN757-mq19G58qTs6wkdxNL9ZA_rik-eNi5M7AEFh4/edit#gid=1685685766 Complete Schedule of Software Engineering]<br />
<br />
== Lecture materials ==<br />
<br />
'''Lecture 1. Introduction to data science and machine learning ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l1-intro/lecture-intro.slides#/ Slides] <br/><br />
<br />
<br />
'''Lecture 2. Cross-validation. Metric-based models. KNN ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l2-metric/lecture-knn.slides#/ Slides] <br/><br />
<br />
<br />
'''Lecture 3. Decision Trees ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l3-trees/lecture-trees.slides#/ Slides] <br/><br />
<br />
== Seminars ==<br />
<br />
'''Seminar 1. Introduction to Data Analysis in Python '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2019/s1-intro Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s1-intro/seminar1-homework.ipynb Practical task 1], [https://www.dropbox.com/request/Ct57iiKQNfoU3CLw21UJ upload link], '''Due Date: 29.01.2019 23:59''' <br/><br />
Additional materials: [https://github.com/esokolov/ml-course-hse/blob/master/2016-fall/seminars/sem01-tools.ipynb 1], [https://drive.google.com/open?id=0B7TWwiIrcJstRzVRSlRFcEl3VGM 2]<br />
<br />
'''Seminar 2. Metric-based methods '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s2-metric/seminar2-knn.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s2-metric/knn_theory.pdf Theoretical task 1], [https://www.dropbox.com/request/6vToJtGIZPcP6ZyHwyc8 upload link], '''Due Date: 04.02.2019 23:59''' <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s2-metric/seminar2-homework.ipynb Practical task 2], [https://www.dropbox.com/request/hauCnxFgVCUtjnV4VXdY upload link], '''Due Date: 05.02.2019 23:59''' <br/><br />
<br />
'''Seminar 3. Decision Trees '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s3-trees/seminar3-trees.ipynb Practice in class], [https://github.com/shestakoff/hse_se_ml/tree/master/2019/s3-trees titanic.csv] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s3-trees/seminar3-homework.ipynb Practical task 3], [https://github.com/shestakoff/hse_se_ml/tree/master/2019/s3-trees data.csv], '''Due Date: 12.02.2019 23:59''' <br/><br />
<br />
<br/><br />
<br />
== Evaluation criteria ==<br />
The course lasts during the 3rd and 4th modules. Knowledge of students is assessed by evaluation of their home assignments and exams. Home assignments divide into theoretical tasks and practical tasks. There are two exams during the course – after the 3rd module and after the 4th module respectively. Each of the exams evaluates theoretical knowledge and understanding of the material studied during the respective module.<br />
<br />
Grade takes values 4,5,…10. Grades, corresponding to 1,2,3 are assumed unsatisfactory. Exact grades are calculated using the following rule:<br />
<br />
* '''score''' ≥ 35% => 4,<br />
* '''score''' ≥ 45% => 5,<br />
* ...<br />
* '''score''' ≥ 95% => 10,<br />
<br />
where '''score''' is calculated using the following rule:<br />
<br />
'''score''' = 0.7 * S<sub>cumulative</sub> + 0.3 * S<sub>exam2</sub> <br \><br />
'''cumulative score''' = 0.8 * S<sub>homework</sub> + 0.2 * S<sub>exam1</sub> + 0.2 * S<sub>competition</sub><br />
<br />
* S<sub>homework</sub> – proportion of correctly solved homework,<br />
* S<sub>exam1</sub> – proportion of successfully answered theoretical questions during exam after module 3,<br />
* S<sub>exam2</sub> – proportion of successfully answered theoretical questions during exam after module 4,<br />
* S<sub>competition</sub> – score for the competition in machine learning (it's also from 0 to 1).<br />
<br />
Participation in machine learning competition is optional and can give students extra points. <br \><br />
"Automative" passing of the course based on '''cumulative score''' ''may'' be issued.<br />
<br />
== Plagiarism ==<br />
In case of discovered plagiarism zero points will be set for the home assignemets - for both works, which were found to be identical. In case of repeated plagiarism by one and the same person a report to the dean will be made.<br />
<br />
== Deadlines ==<br />
<br />
Assignments sent after late deadlines will not be scored (assigned with zero score) in the absence of legitimate reasons for late submission which do not include high load on other classes. <br />
<br />
== Structure of emails and homework submissions ==<br />
Practical assignments must be implemented in jupyter notebook format, theoretical ones in pdf. Practical assignments must use '''Python 3''' (or Python 3 compatible). Use your surname as a filename for assignments (e.g. Ivanov.ipynb). Do not archive your assignments.<br />
<br />
Assignments can be performed in either Russian or English.<br />
<br />
'''Assignments can be submitted only once!'''<br />
<br />
== Useful links ==<br />
=== Machine learning ===<br />
* [https://github.com/esokolov/ml-course-hse Machine learning course from Evgeny Sokolov on Github]<br />
* [http://www.machinelearning.ru/wiki/index.php?title=Заглавная_страница machinelearning.ru]<br />
* [https://yandexdataschool.ru/edu-process/courses/machine-learning Video-lectures of K. Vorontsov on machine learning]<br />
* [https://www.dataoptimal.com/wp-content/uploads/Data-Science-Books-for-2018.pdf Some books for ML1]<br />
* [https://anvaka.github.io/greview/hands-on-ml/1/ Some books for ML2]<br />
* On of the classic ML books. [http://web.stanford.edu/~hastie/local.ftp/Springer/ESLII_print10.pdf Elements of Statistical Learning (Trevor Hastie, Robert Tibshirani, Jerome Friedman)]<br />
<br />
=== Python ===<br />
* [http://python.org Official website]<br />
* Libraries: [http://www.numpy.org/ NumPy], [http://pandas.pydata.org/ Pandas], [http://scikit-learn.org/stable/ SciKit-Learn], [http://matplotlib.org/ Matplotlib].<br />
* A little example for the begginers: [http://nbviewer.ipython.org/gist/voron13e02/83a86f2e0fc5e7f8424d краткое руководство с примерами по Python 2]<br />
* Python from scratch: [http://nbviewer.ipython.org/gist/rpmuller/5920182 A Crash Course in Python for Scientists]<br />
* Lectures [https://github.com/jrjohansson/scientific-python-lectures#online-read-only-versions Scientific Python]<br />
* A book: [http://www.cin.ufpe.br/~embat/Python%20for%20Data%20Analysis.pdf Wes McKinney «Python for Data Analysis»]<br />
* [https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks Коллекция интересных IPython ноутбуков]<br />
<br />
=== Python installation and configuration ===<br />
<br />
[https://www.continuum.io/downloads anaconda]</div>Mhushchynhttp://wiki.cs.hse.ru/Data_analysis_(Software_Engineering)_2019Data analysis (Software Engineering) 20192019-02-03T18:04:40Z<p>Mhushchyn: Seminar 3</p>
<hr />
<div>'''[https://docs.google.com/spreadsheets/d/1qKJtHeqXeTrDMlzxWXORiaTbUBM1QMd2DJjpA1eshm8/edit?usp=sharing Scores]''' <br /><br />
'''[https://join.slack.com/t/hse-se-ml/shared_invite/enQtNTMxODU4MTMwNjQyLTFlM2FkMmViNGI2YTg2ZmRjMjU5ZTEyMmFlYmU2NGVjN2U2YTAzNjAwZjhlYmUwNGFjNjJmYmY5MWVjNTNmZmY Slack Invite Link] <br /><br />
'''Anonymous feedback form:''' [https://goo.gl/forms/xTfnM328m8ulT4FF2 here]<br /><br />
'''[[ Data_analysis_(Software_Engineering)_2018 | Previous Course Page ]]''' <br /><br />
'''[https://github.com/shestakoff/hse_se_ml Course repo]<br/><br />
<br />
<br />
<br /><br />
<br />
== Course description ==<br />
In this class we consider the main problems of data mining and machine learning: classification, clustering, regression, dimensionality reduction, ranking, collaborative filtering. We will also study mathematical methods and concepts which data analysis is based on as well as formal assumptions behind them and various aspects of their implementation.<br />
<br />
A significant attention is given to practical skills of data analysis that will be developed on seminars by studying the Python programming language and relevant libraries for scientific computing.<br />
<br />
The knowledge of linear algebra, real analysis and probability theory is required.<br />
<br />
'''The class consists of:'''<br />
# Lectures and seminars<br />
# Practical and theoretical homework assignments<br />
# A machine learning competition (more information will be available later)<br />
# Midterm theoretical colloquium<br />
# Final exam<br />
<br />
== Course Schedule (3rd module)==<br />
===Seminars===<br />
'''Dates: Mondays (21.01, 28.01, 04.02, 11.02, 18.02, 25.02, 04.03, 11.03)'''<br />
* Group BPI-161, 10:30-11:50, Room 507<br />
* Group BPI-162, 12:10-13:30, Room 311<br />
* Group BPI-163, 13:40-15:00, Room 435<br />
<br />
===Lectures===<br />
'''Dates: Tuesdays (15.01, 22.01, 29.01, 05.02, 12.02, 19.02, 12.03, 19.03)'''<br />
* 9:00-10:20, Room 317<br />
<br />
[https://docs.google.com/spreadsheets/d/1pLN757-mq19G58qTs6wkdxNL9ZA_rik-eNi5M7AEFh4/edit#gid=1685685766 Complete Schedule of Software Engineering]<br />
<br />
== Lecture materials ==<br />
<br />
'''Lecture 1. Introduction to data science and machine learning ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l1-intro/lecture-intro.slides#/ Slides] <br/><br />
<br />
<br />
'''Lecture 2. Cross-validation. Metric-based models. KNN ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l2-metric/lecture-knn.slides#/ Slides] <br/><br />
<br />
<br />
'''Lecture 3. Decision Trees ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l3-trees/lecture-trees.slides#/ Slides] <br/><br />
<br />
== Seminars ==<br />
<br />
'''Seminar 1. Introduction to Data Analysis in Python '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2019/s1-intro Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s1-intro/seminar1-homework.ipynb Practical task 1], [https://www.dropbox.com/request/Ct57iiKQNfoU3CLw21UJ upload link], '''Due Date: 29.01.2019 23:59''' <br/><br />
Additional materials: [https://github.com/esokolov/ml-course-hse/blob/master/2016-fall/seminars/sem01-tools.ipynb 1], [https://drive.google.com/open?id=0B7TWwiIrcJstRzVRSlRFcEl3VGM 2]<br />
<br />
'''Seminar 2. Metric-based methods '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s2-metric/seminar2-knn.ipynb Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s2-metric/knn_theory.pdf Theoretical task 1], [https://www.dropbox.com/request/6vToJtGIZPcP6ZyHwyc8 upload link], '''Due Date: 04.02.2019 23:59''' <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s2-metric/seminar2-homework.ipynb Practical task 2], [https://www.dropbox.com/request/hauCnxFgVCUtjnV4VXdY upload link], '''Due Date: 05.02.2019 23:59''' <br/><br />
<br />
'''Seminar 3. Decision Trees '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s3-trees/seminar3-trees.ipynb Practice in class], [https://github.com/shestakoff/hse_se_ml/tree/master/2019/s3-trees titanic.csv] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s3-trees/seminar3-homework.ipynb Practical task 3], [https://github.com/shestakoff/hse_se_ml/tree/master/2019/s3-trees data.csv] <br/><br />
<br />
<br/><br />
<br />
== Evaluation criteria ==<br />
The course lasts during the 3rd and 4th modules. Knowledge of students is assessed by evaluation of their home assignments and exams. Home assignments divide into theoretical tasks and practical tasks. There are two exams during the course – after the 3rd module and after the 4th module respectively. Each of the exams evaluates theoretical knowledge and understanding of the material studied during the respective module.<br />
<br />
Grade takes values 4,5,…10. Grades, corresponding to 1,2,3 are assumed unsatisfactory. Exact grades are calculated using the following rule:<br />
<br />
* '''score''' ≥ 35% => 4,<br />
* '''score''' ≥ 45% => 5,<br />
* ...<br />
* '''score''' ≥ 95% => 10,<br />
<br />
where '''score''' is calculated using the following rule:<br />
<br />
'''score''' = 0.7 * S<sub>cumulative</sub> + 0.3 * S<sub>exam2</sub> <br \><br />
'''cumulative score''' = 0.8 * S<sub>homework</sub> + 0.2 * S<sub>exam1</sub> + 0.2 * S<sub>competition</sub><br />
<br />
* S<sub>homework</sub> – proportion of correctly solved homework,<br />
* S<sub>exam1</sub> – proportion of successfully answered theoretical questions during exam after module 3,<br />
* S<sub>exam2</sub> – proportion of successfully answered theoretical questions during exam after module 4,<br />
* S<sub>competition</sub> – score for the competition in machine learning (it's also from 0 to 1).<br />
<br />
Participation in machine learning competition is optional and can give students extra points. <br \><br />
"Automative" passing of the course based on '''cumulative score''' ''may'' be issued.<br />
<br />
== Plagiarism ==<br />
In case of discovered plagiarism zero points will be set for the home assignemets - for both works, which were found to be identical. In case of repeated plagiarism by one and the same person a report to the dean will be made.<br />
<br />
== Deadlines ==<br />
<br />
Assignments sent after late deadlines will not be scored (assigned with zero score) in the absence of legitimate reasons for late submission which do not include high load on other classes. <br />
<br />
== Structure of emails and homework submissions ==<br />
Practical assignments must be implemented in jupyter notebook format, theoretical ones in pdf. Practical assignments must use '''Python 3''' (or Python 3 compatible). Use your surname as a filename for assignments (e.g. Ivanov.ipynb). Do not archive your assignments.<br />
<br />
Assignments can be performed in either Russian or English.<br />
<br />
'''Assignments can be submitted only once!'''<br />
<br />
== Useful links ==<br />
=== Machine learning ===<br />
* [https://github.com/esokolov/ml-course-hse Machine learning course from Evgeny Sokolov on Github]<br />
* [http://www.machinelearning.ru/wiki/index.php?title=Заглавная_страница machinelearning.ru]<br />
* [https://yandexdataschool.ru/edu-process/courses/machine-learning Video-lectures of K. Vorontsov on machine learning]<br />
* [https://www.dataoptimal.com/wp-content/uploads/Data-Science-Books-for-2018.pdf Some books for ML1]<br />
* [https://anvaka.github.io/greview/hands-on-ml/1/ Some books for ML2]<br />
* On of the classic ML books. [http://web.stanford.edu/~hastie/local.ftp/Springer/ESLII_print10.pdf Elements of Statistical Learning (Trevor Hastie, Robert Tibshirani, Jerome Friedman)]<br />
<br />
=== Python ===<br />
* [http://python.org Official website]<br />
* Libraries: [http://www.numpy.org/ NumPy], [http://pandas.pydata.org/ Pandas], [http://scikit-learn.org/stable/ SciKit-Learn], [http://matplotlib.org/ Matplotlib].<br />
* A little example for the begginers: [http://nbviewer.ipython.org/gist/voron13e02/83a86f2e0fc5e7f8424d краткое руководство с примерами по Python 2]<br />
* Python from scratch: [http://nbviewer.ipython.org/gist/rpmuller/5920182 A Crash Course in Python for Scientists]<br />
* Lectures [https://github.com/jrjohansson/scientific-python-lectures#online-read-only-versions Scientific Python]<br />
* A book: [http://www.cin.ufpe.br/~embat/Python%20for%20Data%20Analysis.pdf Wes McKinney «Python for Data Analysis»]<br />
* [https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks Коллекция интересных IPython ноутбуков]<br />
<br />
=== Python installation and configuration ===<br />
<br />
[https://www.continuum.io/downloads anaconda]</div>Mhushchynhttp://wiki.cs.hse.ru/Data_analysis_(Software_Engineering)_2019Data analysis (Software Engineering) 20192019-01-20T18:12:48Z<p>Mhushchyn: Материалы по семинару 1</p>
<hr />
<div>'''Anonymous feedback form:''' [https://goo.gl/forms/xTfnM328m8ulT4FF2 here]<br /><br />
'''Scores (comming soon)''' <br /><br />
'''[[ Data_analysis_(Software_Engineering)_2018 | Previous Course Page ]]''' <br /><br />
'''[https://github.com/shestakoff/hse_se_ml Course repo]<br /><br />
<br />
<br /><br />
<br />
== Course description ==<br />
In this class we consider the main problems of data mining and machine learning: classification, clustering, regression, dimensionality reduction, ranking, collaborative filtering. We will also study mathematical methods and concepts which data analysis is based on as well as formal assumptions behind them and various aspects of their implementation.<br />
<br />
A significant attention is given to practical skills of data analysis that will be developed on seminars by studying the Python programming language and relevant libraries for scientific computing.<br />
<br />
The knowledge of linear algebra, real analysis and probability theory is required.<br />
<br />
'''The class consists of:'''<br />
# Lectures and seminars<br />
# Practical and theoretical homework assignments<br />
# A machine learning competition (more information will be available later)<br />
# Midterm theoretical colloquium<br />
# Final exam<br />
<br />
== Lecture materials ==<br />
<br />
'''Lecture 1. Introduction to data science and machine learning ''' <br/><br />
[https://shestakoff.github.io/hse_se_ml/2019/l1-intro/lecture-intro.slides#/ Slides] <br/><br />
<br />
== Seminars ==<br />
<br />
'''Seminar 1. Introduction to Data Analysis in Python '''<br/><br />
[https://github.com/shestakoff/hse_se_ml/tree/master/2019/s1-intro Practice in class] <br/><br />
[https://github.com/shestakoff/hse_se_ml/blob/master/2019/s1-intro/seminar1-homework.ipynb Practical task 1] <br/><br />
Additional materials: [https://github.com/esokolov/ml-course-hse/blob/master/2016-fall/seminars/sem01-tools.ipynb 1], [https://drive.google.com/open?id=0B7TWwiIrcJstRzVRSlRFcEl3VGM 2]<br />
<br />
<br />
<br/><br />
<br />
== Evaluation criteria ==<br />
The course lasts during the 3rd and 4th modules. Knowledge of students is assessed by evaluation of their home assignments and exams. Home assignments divide into theoretical tasks and practical tasks. There are two exams during the course – after the 3rd module and after the 4th module respectively. Each of the exams evaluates theoretical knowledge and understanding of the material studied during the respective module.<br />
<br />
Grade takes values 4,5,…10. Grades, corresponding to 1,2,3 are assumed unsatisfactory. Exact grades are calculated using the following rule:<br />
<br />
* '''score''' ≥ 35% => 4,<br />
* '''score''' ≥ 45% => 5,<br />
* ...<br />
* '''score''' ≥ 95% => 10,<br />
<br />
where '''score''' is calculated using the following rule:<br />
<br />
'''score''' = 0.7 * S<sub>cumulative</sub> + 0.3 * S<sub>exam2</sub> <br \><br />
'''cumulative score''' = 0.8 * S<sub>homework</sub> + 0.2 * S<sub>exam1</sub> + 0.2 * S<sub>competition</sub><br />
<br />
* S<sub>homework</sub> – proportion of correctly solved homework,<br />
* S<sub>exam1</sub> – proportion of successfully answered theoretical questions during exam after module 3,<br />
* S<sub>exam2</sub> – proportion of successfully answered theoretical questions during exam after module 4,<br />
* S<sub>competition</sub> – score for the competition in machine learning (it's also from 0 to 1).<br />
<br />
Participation in machine learning competition is optional and can give students extra points. <br \><br />
"Automative" passing of the course based on '''cumulative score''' ''may'' be issued.<br />
<br />
== Plagiarism ==<br />
In case of discovered plagiarism zero points will be set for the home assignemets - for both works, which were found to be identical. In case of repeated plagiarism by one and the same person a report to the dean will be made.<br />
<br />
== Deadlines ==<br />
<br />
Assignments sent after late deadlines will not be scored (assigned with zero score) in the absence of legitimate reasons for late submission which do not include high load on other classes. <br />
<br />
== Structure of emails and homework submissions ==<br />
Practical assignments must be implemented in jupyter notebook format, theoretical ones in pdf. Practical assignments must use '''Python 2''' (or Python 2 compatible). Use your surname as a filename for assignments (e.g. Ivanov.ipynb). Do not archive your assignments.<br />
<br />
Assignments can be performed in either Russian or English.<br />
<br />
'''Assignments can be submitted only once!'''<br />
<br />
== Useful links ==<br />
=== Machine learning ===<br />
* [https://github.com/esokolov/ml-course-hse Machine learning course from Evgeny Sokolov on Github]<br />
* [http://www.machinelearning.ru/wiki/index.php?title=Заглавная_страница machinelearning.ru]<br />
* [https://yandexdataschool.ru/edu-process/courses/machine-learning Video-lectures of K. Vorontsov on machine learning]<br />
* [https://www.dataoptimal.com/wp-content/uploads/Data-Science-Books-for-2018.pdf Some books for ML1]<br />
* [https://anvaka.github.io/greview/hands-on-ml/1/ Some books for ML2]<br />
* On of the classic ML books. [http://web.stanford.edu/~hastie/local.ftp/Springer/ESLII_print10.pdf Elements of Statistical Learning (Trevor Hastie, Robert Tibshirani, Jerome Friedman)]<br />
<br />
=== Python ===<br />
* [http://python.org Official website]<br />
* Libraries: [http://www.numpy.org/ NumPy], [http://pandas.pydata.org/ Pandas], [http://scikit-learn.org/stable/ SciKit-Learn], [http://matplotlib.org/ Matplotlib].<br />
* A little example for the begginers: [http://nbviewer.ipython.org/gist/voron13e02/83a86f2e0fc5e7f8424d краткое руководство с примерами по Python 2]<br />
* Python from scratch: [http://nbviewer.ipython.org/gist/rpmuller/5920182 A Crash Course in Python for Scientists]<br />
* Lectures [https://github.com/jrjohansson/scientific-python-lectures#online-read-only-versions Scientific Python]<br />
* A book: [http://www.cin.ufpe.br/~embat/Python%20for%20Data%20Analysis.pdf Wes McKinney «Python for Data Analysis»]<br />
* [https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks Коллекция интересных IPython ноутбуков]<br />
<br />
=== Python installation and configuration ===<br />
<br />
[https://www.continuum.io/downloads anaconda]</div>Mhushchyn