Data analysis (Software Engineering) 2020 — различия между версиями
Mhushchyn (обсуждение | вклад) |
(→Final Exam) |
||
(не показано 59 промежуточных версии 5 участников) | |||
Строка 1: | Строка 1: | ||
− | '''[https://join.slack.com/t/hse-se-ml/shared_invite/ | + | '''[https://join.slack.com/t/hse-se-ml/shared_invite/enQtOTgxOTM5MDE5NTU5LTY5MGI2YWEwYWJkNmM5YmFhNDFkYjIwZjU0MTQyNDNmZDZkOTZmNTE4OGNhOGJlMzMwMDU0ZTc0YjRiYzQyMmY Slack Invite Link] <br /> |
'''Anonymous feedback form:''' [https://goo.gl/forms/xTfnM328m8ulT4FF2 here]<br /> | '''Anonymous feedback form:''' [https://goo.gl/forms/xTfnM328m8ulT4FF2 here]<br /> | ||
'''[[ Data_analysis_(Software_Engineering)_2019 | Previous Course Page ]]''' <br /> | '''[[ Data_analysis_(Software_Engineering)_2019 | Previous Course Page ]]''' <br /> | ||
'''[https://github.com/shestakoff/hse_se_ml Course repo]<br/> | '''[https://github.com/shestakoff/hse_se_ml Course repo]<br/> | ||
+ | '''[https://anytask.org/course/608 Anytask]<br/> | ||
+ | '''[https://docs.google.com/spreadsheets/d/16S7rSqnt4Al_IV8A5D0yARtx1Lc-Dl3N2uDDroa0FD0/edit?usp=sharing Scores]<br/> | ||
Строка 20: | Строка 22: | ||
# Midterm theoretical colloquium | # Midterm theoretical colloquium | ||
# Final exam | # Final exam | ||
+ | |||
+ | == Final Exam == | ||
+ | Final exam will be held on the '''8th of June''' | ||
+ | |||
+ | Details about timing will be available soon | ||
+ | |||
+ | Questions list and rules are available [https://cloud.mail.ru/public/HSZw/WhDfGE1CM here]. | ||
+ | |||
+ | For complete instructions please read #general channel in slack | ||
+ | |||
+ | == Kaggle == | ||
+ | Link to competition is in slack | ||
+ | |||
+ | You should send reports before June 5 23:59 (Competition ends on the 4th of June, late submissions are not considered). <br/> | ||
+ | Reports should be sent to the special [https://forms.gle/5QTo7ycrn1zGhg9m8 form] | ||
+ | |||
+ | Try to follow the format of report template - https://github.com/shestakoff/hse_se_ml/blob/master/2019/kaggle/kaggle-report-template.ipynb | ||
+ | |||
+ | == Colloquium == | ||
+ | Colloquium will be held on the '''7th of April''' during seminar | ||
+ | |||
+ | You may not use any materials during colloquium, except single A4 prepared before the exam and handwritten personally by you (from two sides). You will have 2 questions from [https://github.com/shestakoff/hse_se_ml/raw/master/2020/colloq/colloq-2020.pdf '''question list'''] with 15 minutes for preparation and may receive additional questions or tasks. | ||
== Course Schedule (3rd module)== | == Course Schedule (3rd module)== | ||
Строка 30: | Строка 54: | ||
'''Lecture 1. Introduction to data science and machine learning ''' <br/> | '''Lecture 1. Introduction to data science and machine learning ''' <br/> | ||
[https://shestakoff.github.io/hse_se_ml/2020/l01-intro/lecture-intro.slides#/ Slides] <br/> | [https://shestakoff.github.io/hse_se_ml/2020/l01-intro/lecture-intro.slides#/ Slides] <br/> | ||
+ | |||
+ | '''Lecture 2. Metric-based methods. K-NN ''' <br/> | ||
+ | [https://shestakoff.github.io/hse_se_ml/2020/l02-knn/lecture-knn.slides#/ Slides] <br/> | ||
+ | |||
+ | '''Lecture 3. Decision Trees''' <br/> | ||
+ | [https://shestakoff.github.io/hse_se_ml/2020/l03-trees/lecture-trees.slides#/ Slides] <br/> | ||
+ | |||
+ | '''Lecture 4. Linear Regression''' <br/> | ||
+ | [https://shestakoff.github.io/hse_se_ml/2020/l04-linreg/lecture-linreg.slides#/ Slides] <br/> | ||
+ | |||
+ | '''Lecture 5. Linear Classification''' <br/> | ||
+ | [https://shestakoff.github.io/hse_se_ml/2020/l05-linclass/lecture-linclass.slides#/ Slides] <br/> | ||
+ | |||
+ | '''Lecture 6. Quality measures''' <br/> | ||
+ | [https://shestakoff.github.io/hse_se_ml/2020/l06-metrics/lecture-metrics.slides#/ Slides], [https://youtu.be/kItcW-G0wzM record] <br/> | ||
+ | |||
+ | '''Lecture 7. Dimension reductio. PCA''' <br/> | ||
+ | [https://shestakoff.github.io/hse_se_ml/2020/l07-dimred/lecture-dimred.slides#/ Slides] <br/> | ||
+ | |||
+ | '''Lecture 8. NLP Introduction''' <br/> | ||
+ | [https://shestakoff.github.io/hse_se_ml/2020/l08-nlp-intro/lecture-nlp-intro.slides#/ Slides] <br/> | ||
+ | |||
+ | '''Lecture 9. Word embeddings''' <br/> | ||
+ | [https://shestakoff.github.io/hse_se_ml/2020/l09-nlp-w2v/lecture-nlp-w2v.slides#/ Slides] <br/> | ||
+ | |||
+ | '''Lecture 10. Ensembles. Random Forest''' <br/> | ||
+ | [https://shestakoff.github.io/hse_se_ml/2020/l10-ensembles/lecture-ensemble.slides#/ Slides] <br/> | ||
+ | |||
+ | '''Lecture 11. Ensembles. Boosting''' <br/> | ||
+ | [https://shestakoff.github.io/hse_se_ml/2020/l11-boosting/lecture-boosting.slides#/ Slides] <br/> | ||
+ | |||
+ | '''Lecture 12. Neural Networks 1''' <br/> | ||
+ | [https://shestakoff.github.io/hse_se_ml/2020/l12-nn1/lecture-nn1.slides#/ Slides] <br/> | ||
+ | |||
+ | '''Lecture 13. Neural Networks 2''' <br/> | ||
+ | [https://shestakoff.github.io/hse_se_ml/2020/l13-nn2/lecture-nn2.slides#/ Slides] <br/> | ||
+ | |||
+ | '''Lecture 14. Clustering''' <br/> | ||
+ | [https://shestakoff.github.io/hse_se_ml/2020/l14-cluster/lecture-clust.slides#/ Slides] <br/> | ||
+ | |||
+ | '''Lecture 15. Recsys''' <br/> | ||
+ | [https://shestakoff.github.io/hse_se_ml/2020/l15-recsys/lecture-recsys.slides#/ Slides] <br/> | ||
== Seminars == | == Seminars == | ||
Строка 35: | Строка 101: | ||
'''Seminar 1. Introduction to Data Analysis in Python '''<br/> | '''Seminar 1. Introduction to Data Analysis in Python '''<br/> | ||
[https://github.com/shestakoff/hse_se_ml/tree/master/2020/s01-intro-to-python Practice in class] <br/> | [https://github.com/shestakoff/hse_se_ml/tree/master/2020/s01-intro-to-python Practice in class] <br/> | ||
− | [https://github.com/shestakoff/hse_se_ml/blob/master/2020/s01-intro-to-python/seminar1-homework.ipynb | + | [https://github.com/shestakoff/hse_se_ml/blob/master/2020/s01-intro-to-python/seminar1-homework.ipynb Homework 1] '''Due Date: 28.01.2020 23:59'''<br/> |
+ | |||
+ | '''Seminar 2. Metric-based methods. K-NN'''<br/> | ||
+ | [https://github.com/shestakoff/hse_se_ml/tree/master/2020/s02-metric-based-methods%20 Practice in class] <br/> | ||
+ | [https://github.com/shestakoff/hse_se_ml/blob/master/2020/s02-metric-based-methods%20/seminar2-homework.ipynb Homework 2] '''Due Date: 04.02.2020 23:59'''<br/> | ||
+ | |||
+ | '''Seminar 3. Decision Trees'''<br/> | ||
+ | [https://github.com/shestakoff/hse_se_ml/tree/master/2020/s03-decision-trees Practice in class] <br/> | ||
+ | [https://github.com/shestakoff/hse_se_ml/blob/master/2020/s03-decision-trees//seminar3-homework.ipynb Homework 3] '''Due Date: 01.03.2020 23:59'''<br/> | ||
+ | |||
+ | '''Seminar 4. Linear Regression'''<br/> | ||
+ | [https://github.com/shestakoff/hse_se_ml/tree/master/2020/s04-linear-regression Practice in class] <br/> | ||
+ | |||
+ | '''Seminar 5. Logistic Regression'''<br/> | ||
+ | [https://github.com/shestakoff/hse_se_ml/blob/master/2020/s05-logistic-regresion/seminar5-logreg.ipynb Practice in class] <br/> | ||
+ | [https://github.com/shestakoff/hse_se_ml/blob/master/2020/s05-logistic-regresion/seminar5-homework.ipynb Homework 4] '''Due Date: 22.03.2020 23:59'''<br/> | ||
+ | |||
+ | '''Seminar 6. Quality Measures'''<br/> | ||
+ | [https://github.com/matyushinleonid/hse_se_ml/blob/master/2020/s06-quality-measures/seminar6-quality.ipynb Practice in class] <br/> | ||
+ | |||
+ | '''Seminar 7. Dimention Reduction'''<br/> | ||
+ | [https://github.com/shestakoff/hse_se_ml/blob/master/2020/s07-dimred/seminar7-dimred.ipynb Practice in class] <br/> | ||
+ | [https://github.com/shestakoff/hse_se_ml/blob/master/2020/s07-dimred/seminar7-homework.ipynb Homework 5] '''Due Date: 12.04.2020 23:59'''<br/> | ||
+ | |||
+ | '''Seminar 8. Introduction to NLP'''<br/> | ||
+ | [https://github.com/shestakoff/hse_se_ml/tree/master/2020/s08-nlp Practice in class] <br/> | ||
+ | [https://www.kaggle.com/c/explicit-content-detection Kaggle 1] '''Due Date: 28.04.2020 23:59'''<br/> | ||
+ | |||
+ | '''Seminar 9. Word2Vec'''<br/> | ||
+ | [https://github.com/shestakoff/hse_se_ml/tree/master/2020/s09-word2vec Practice in class] <br/> | ||
+ | |||
+ | '''Seminar 10. Ensembles. Random Forest'''<br/> | ||
+ | [https://github.com/shestakoff/hse_se_ml/blob/master/2020/s10-ensembles/seminar10-ensembles.ipynb Practice in class] <br/> | ||
+ | [https://github.com/shestakoff/hse_se_ml/blob/master/2020/s10-ensembles/seminar10-homework.ipynb Homework 6] '''Due Date: 12.05.2020 23:59'''<br/> | ||
+ | |||
+ | '''Seminar 11. Boosting'''<br/> | ||
+ | [https://github.com/shestakoff/hse_se_ml/blob/master/2020/s11-boosting/seminar.ipynb Practice in class] <br/> | ||
+ | [https://github.com/shestakoff/hse_se_ml/blob/master/2020/s11-boosting/homework.ipynb Homework 7] '''Due Date: 19.05.2020 23:59'''<br/> | ||
+ | |||
+ | '''Seminar 12. NN-1'''<br/> | ||
+ | [https://github.com/shestakoff/hse_se_ml/blob/master/2020/s12-nn1/seminar12-nn1.ipynb Practice in class] <br/> | ||
+ | |||
+ | '''Seminar 13. NN-2'''<br/> | ||
+ | [https://github.com/shestakoff/hse_se_ml/blob/master/2020/s13-nn2/seminar13-nn2.ipynb Practice in class] <br/> | ||
+ | [https://github.com/shestakoff/hse_se_ml/blob/master/2020/s13-nn2/seminar13-homework.ipynb Homework 8] '''Due Date: 27.05.2020 23:59'''<br/> | ||
+ | |||
+ | '''Seminar 14. Clustering'''<br/> | ||
+ | [https://github.com/shestakoff/hse_se_ml/tree/master/2020/s14-clustering Practice in class] <br/> | ||
+ | |||
+ | '''Seminar 15. RecSys'''<br/> | ||
+ | [https://github.com/shestakoff/hse_se_ml/tree/master/2020/s15-recsys Practice in class] <br/> | ||
+ | |||
+ | == Theoretical questions for the colloquium == | ||
+ | |||
+ | [https://github.com/shestakoff/hse_se_ml/blob/master/2020/s02-metric-based-methods%20/knn_theory.pdf Metric-based methods. K-NN] <br/> | ||
+ | [https://github.com/shestakoff/hse_se_ml/blob/master/2020/s03-decision-trees/trees_theory.pdf Decision Trees] <br/> | ||
+ | [https://github.com/shestakoff/hse_se_ml/blob/master/2020/s04-linear-regression/linreg_theory.pdf Linear Regression] <br/> | ||
+ | [https://github.com/shestakoff/hse_se_ml/blob/master/2020/s05-logistic-regresion/linclass_theory.pdf Logistic Regression] <br/> | ||
+ | [https://github.com/matyushinleonid/hse_se_ml/blob/master/2020/s06-quality-measures/metrics_svm.pdf Quality Measures] <br/> | ||
+ | |||
+ | [https://github.com/shestakoff/hse_se_ml/blob/master/2020/s11-boosting/boosting-theory.pdf Boosting] <br/> | ||
== Evaluation criteria == | == Evaluation criteria == | ||
Строка 59: | Строка 185: | ||
Participation in machine learning competition is optional and can give students extra points. <br \> | Participation in machine learning competition is optional and can give students extra points. <br \> | ||
"Automative" passing of the course based on '''cumulative score''' ''may'' be issued. | "Automative" passing of the course based on '''cumulative score''' ''may'' be issued. | ||
+ | |||
+ | '''Kaggle competition 1'''<br/> | ||
+ | «Score» = ("your quality"-"baseline method quality") / ("max achieved quality" - "baseline method quality") <br \> | ||
+ | '''Required condition:''' a notebook with your best solution must be reproducible. Otherwise, you will not get any score. | ||
== Plagiarism == | == Plagiarism == | ||
Строка 73: | Строка 203: | ||
'''Assignments can be submitted only once!''' | '''Assignments can be submitted only once!''' | ||
+ | |||
+ | Link for the submissions: '''[https://anytask.org/course/608 Anytask.]<br/> | ||
== Useful links == | == Useful links == |
Текущая версия на 19:35, 7 июня 2020
Slack Invite Link
Anonymous feedback form: here
Previous Course Page
Course repo
Anytask
Scores
Содержание
Course description
In this class we consider the main problems of data mining and machine learning: classification, clustering, regression, dimensionality reduction, ranking, collaborative filtering. We will also study mathematical methods and concepts which data analysis is based on as well as formal assumptions behind them and various aspects of their implementation.
A significant attention is given to practical skills of data analysis that will be developed on seminars by studying the Python programming language and relevant libraries for scientific computing.
The knowledge of linear algebra, real analysis and probability theory is required.
The class consists of:
- Lectures and seminars
- Practical and theoretical homework assignments
- A machine learning competition (more information will be available later)
- Midterm theoretical colloquium
- Final exam
Final Exam
Final exam will be held on the 8th of June
Details about timing will be available soon
Questions list and rules are available here.
For complete instructions please read #general channel in slack
Kaggle
Link to competition is in slack
You should send reports before June 5 23:59 (Competition ends on the 4th of June, late submissions are not considered).
Reports should be sent to the special form
Try to follow the format of report template - https://github.com/shestakoff/hse_se_ml/blob/master/2019/kaggle/kaggle-report-template.ipynb
Colloquium
Colloquium will be held on the 7th of April during seminar
You may not use any materials during colloquium, except single A4 prepared before the exam and handwritten personally by you (from two sides). You will have 2 questions from question list with 15 minutes for preparation and may receive additional questions or tasks.
Course Schedule (3rd module)
Lectures
Mondays
- 10:30-11:50, Room R205
Lecture materials
Lecture 1. Introduction to data science and machine learning
Slides
Lecture 2. Metric-based methods. K-NN
Slides
Lecture 3. Decision Trees
Slides
Lecture 4. Linear Regression
Slides
Lecture 5. Linear Classification
Slides
Lecture 6. Quality measures
Slides, record
Lecture 7. Dimension reductio. PCA
Slides
Lecture 8. NLP Introduction
Slides
Lecture 9. Word embeddings
Slides
Lecture 10. Ensembles. Random Forest
Slides
Lecture 11. Ensembles. Boosting
Slides
Lecture 12. Neural Networks 1
Slides
Lecture 13. Neural Networks 2
Slides
Lecture 14. Clustering
Slides
Lecture 15. Recsys
Slides
Seminars
Seminar 1. Introduction to Data Analysis in Python
Practice in class
Homework 1 Due Date: 28.01.2020 23:59
Seminar 2. Metric-based methods. K-NN
Practice in class
Homework 2 Due Date: 04.02.2020 23:59
Seminar 3. Decision Trees
Practice in class
Homework 3 Due Date: 01.03.2020 23:59
Seminar 4. Linear Regression
Practice in class
Seminar 5. Logistic Regression
Practice in class
Homework 4 Due Date: 22.03.2020 23:59
Seminar 6. Quality Measures
Practice in class
Seminar 7. Dimention Reduction
Practice in class
Homework 5 Due Date: 12.04.2020 23:59
Seminar 8. Introduction to NLP
Practice in class
Kaggle 1 Due Date: 28.04.2020 23:59
Seminar 9. Word2Vec
Practice in class
Seminar 10. Ensembles. Random Forest
Practice in class
Homework 6 Due Date: 12.05.2020 23:59
Seminar 11. Boosting
Practice in class
Homework 7 Due Date: 19.05.2020 23:59
Seminar 12. NN-1
Practice in class
Seminar 13. NN-2
Practice in class
Homework 8 Due Date: 27.05.2020 23:59
Seminar 14. Clustering
Practice in class
Seminar 15. RecSys
Practice in class
Theoretical questions for the colloquium
Metric-based methods. K-NN
Decision Trees
Linear Regression
Logistic Regression
Quality Measures
Evaluation criteria
The course lasts during the 3rd and 4th modules. Knowledge of students is assessed by evaluation of their home assignments and exams. There are two exams during the course – after the 3rd module and after the 4th module respectively. Each of the exams evaluates theoretical knowledge and understanding of the material studied during the respective module.
Grade takes values 4,5,…10. Grades, corresponding to 1,2,3 are assumed unsatisfactory. Exact grades are calculated using the following rule:
- score ≥ 35% => 4,
- score ≥ 45% => 5,
- ...
- score ≥ 95% => 10,
where score is calculated using the following rule:
score = 0.7 * S_{cumulative} + 0.3 * S_{exam2}
cumulative score = 0.8 * S_{homework} + 0.2 * S_{exam1} + 0.2 * S_{competition}
- S_{homework} – proportion of correctly solved homework,
- S_{exam1} – proportion of successfully answered theoretical questions during exam after module 3,
- S_{exam2} – proportion of successfully answered theoretical questions during exam after module 4,
- S_{competition} – score for the competition in machine learning (it's also from 0 to 1).
Participation in machine learning competition is optional and can give students extra points.
"Automative" passing of the course based on cumulative score may be issued.
Kaggle competition 1
«Score» = ("your quality"-"baseline method quality") / ("max achieved quality" - "baseline method quality")
Required condition: a notebook with your best solution must be reproducible. Otherwise, you will not get any score.
Plagiarism
In case of discovered plagiarism zero points will be set for the home assignemets - for both works, which were found to be identical. In case of repeated plagiarism by one and the same person a report to the dean will be made.
Deadlines
Assignments sent after late deadlines will not be scored (assigned with zero score) in the absence of legitimate reasons for late submission which do not include high load on other classes.
Structure of emails and homework submissions
Practical assignments must be implemented in Jupyter Notebook format, theoretical ones in pdf. Practical assignments must use Python 3 (or Python 3 compatible). Use your surname as a filename for assignments (e.g. Ivanov.ipynb). Do not archive your assignments.
Assignments can be performed in either Russian or English.
Assignments can be submitted only once!
Link for the submissions: Anytask.
Useful links
Machine learning, Stats, Maths
- Machine learning course from Evgeny Sokolov on Github
- machinelearning.ru
- Video-lectures of K. Vorontsov on machine learning
- Some books for ML1
- Some books for ML2
- Math for ML
- One of classic ML books. Elements of Statistical Learning (Trevor Hastie, Robert Tibshirani, Jerome Friedman)
- Linear Algebra Immersive book
Python
- Official website
- Libraries: NumPy, Pandas, SciKit-Learn, Matplotlib.
- A little example for the begginers: краткое руководство с примерами по Python 2
- Python from scratch: A Crash Course in Python for Scientists
- Lectures Scientific Python
- A book: Wes McKinney «Python for Data Analysis»
- Коллекция интересных IPython ноутбуков