Data analysis (Software Engineering) 2019
- 1 Course description
- 2 Kaggle
- 3 Colloquium
- 4 Course Schedule (4th module)
- 5 Lecture materials
- 6 Seminars
- 7 Evaluation criteria
- 8 Plagiarism
- 9 Deadlines
- 10 Structure of emails and homework submissions
- 11 Useful links
In this class we consider the main problems of data mining and machine learning: classification, clustering, regression, dimensionality reduction, ranking, collaborative filtering. We will also study mathematical methods and concepts which data analysis is based on as well as formal assumptions behind them and various aspects of their implementation.
A significant attention is given to practical skills of data analysis that will be developed on seminars by studying the Python programming language and relevant libraries for scientific computing.
The knowledge of linear algebra, real analysis and probability theory is required.
The class consists of:
- Lectures and seminars
- Practical and theoretical homework assignments
- A machine learning competition (more information will be available later)
- Midterm theoretical colloquium
- Final exam
Link to competition is in slack
You should send reports before June 14 23:59 (Competition ends on the 13th of June). Reports should be sent here. Try to follow the format of report template - https://github.com/shestakoff/hse_se_ml/blob/master/2019/kaggle/kaggle-report-template.ipynb
Colloquium will be held on the 1th and 2nd of April during seminars and lecture
You may not use any materials during colloquium, except single A4 prepared before the exam and handwritten personally by you (from two sides). You will have 2 questions from question list with 20 minutes for preparation and may receive additional questions or tasks.
We are having serious time limits, so come at your seminar or earlier seminar.
Course Schedule (4th module)
Dates: Mondays (01.04, 08.04, 15.04, 22.04, 13.05, 20.05, 27.05, 03.06, 10.06)
- Group BPI-161, 9:00-10:30, Room 501
- Group BPI-162, 10:30-11:50, Room 311
- Group BPI-163, 12:10-13:30, Room 311
Dates: Tuesdays (02.04, 09.04, 16.04, 23.04, 14.05, 21.05, 28.05, 04.06)
- 9:00-10:20, Room 317
04.06 - Room 402
Lecture 1. Introduction to data science and machine learning
Lecture 2. Cross-validation. Metric-based models. KNN
Lecture 3. Decision Trees
Lecture 4. Linear Regression, Gradient-based optimization
Lecture 5. Regularization, Linear Classification
Lecture 6. Supervised Quality Measures
Lecture 7. Support Vector Machines. Kernel Trick
Lecture 8. Feature Selection. Dimension Reduction. PCA
Lecture 9. Ensembles
Lecture 10. Boosting
Lecture 11. Neural Networks 1
Lecture 12. Neural Networks 2
Lecture 13. Introduction to NLP
Lecture 14. Clustering
Seminar 13. Intro to Kaggle and NLP
Practice in class
To ease examination process for our course assistants, please, put your subgroup number in the beginning of solution filenames
The course lasts during the 3rd and 4th modules. Knowledge of students is assessed by evaluation of their home assignments and exams. Home assignments divide into theoretical tasks and practical tasks. There are two exams during the course – after the 3rd module and after the 4th module respectively. Each of the exams evaluates theoretical knowledge and understanding of the material studied during the respective module.
Grade takes values 4,5,…10. Grades, corresponding to 1,2,3 are assumed unsatisfactory. Exact grades are calculated using the following rule:
- score ≥ 35% => 4,
- score ≥ 45% => 5,
- score ≥ 95% => 10,
where score is calculated using the following rule:
score = 0.7 * Scumulative + 0.3 * Sexam2
cumulative score = 0.8 * Shomework + 0.2 * Sexam1 + 0.2 * Scompetition
- Shomework – proportion of correctly solved homework,
- Sexam1 – proportion of successfully answered theoretical questions during exam after module 3,
- Sexam2 – proportion of successfully answered theoretical questions during exam after module 4,
- Scompetition – score for the competition in machine learning (it's also from 0 to 1).
Participation in machine learning competition is optional and can give students extra points.
"Automative" passing of the course based on cumulative score may be issued.
In case of discovered plagiarism zero points will be set for the home assignemets - for both works, which were found to be identical. In case of repeated plagiarism by one and the same person a report to the dean will be made.
Assignments sent after late deadlines will not be scored (assigned with zero score) in the absence of legitimate reasons for late submission which do not include high load on other classes.
Structure of emails and homework submissions
Practical assignments must be implemented in jupyter notebook format, theoretical ones in pdf. Practical assignments must use Python 3 (or Python 3 compatible). Use your surname as a filename for assignments (e.g. Ivanov.ipynb). Do not archive your assignments.
Assignments can be performed in either Russian or English.
Assignments can be submitted only once!
Machine learning, Stats, Maths
- Machine learning course from Evgeny Sokolov on Github
- Video-lectures of K. Vorontsov on machine learning
- Some books for ML1
- Some books for ML2
- Math for ML
- One of classic ML books. Elements of Statistical Learning (Trevor Hastie, Robert Tibshirani, Jerome Friedman)
- Linear Algebra Immersive book
- Official website
- Libraries: NumPy, Pandas, SciKit-Learn, Matplotlib.
- A little example for the begginers: краткое руководство с примерами по Python 2
- Python from scratch: A Crash Course in Python for Scientists
- Lectures Scientific Python
- A book: Wes McKinney «Python for Data Analysis»
- Коллекция интересных IPython ноутбуков