Data analysis (Software Engineering) 2017 — различия между версиями
Arbabenko (обсуждение | вклад) |
Apogentus (обсуждение | вклад) |
||
(не показано 69 промежуточных версии 3 участников) | |||
Строка 1: | Строка 1: | ||
− | |||
− | |||
'''Class email:''' cshse.ml@gmail.com<br /> | '''Class email:''' cshse.ml@gmail.com<br /> | ||
'''Anonymous feedback form:''' [https://docs.google.com/forms/d/e/1FAIpQLSdxP-U47SLedjTvF0CyxHSIYy8eTUnzcDOc9DIl4gFSD2-ixA/viewform here]<br /> | '''Anonymous feedback form:''' [https://docs.google.com/forms/d/e/1FAIpQLSdxP-U47SLedjTvF0CyxHSIYy8eTUnzcDOc9DIl4gFSD2-ixA/viewform here]<br /> | ||
Строка 6: | Строка 4: | ||
<br /> | <br /> | ||
+ | |||
+ | == Exam questions == | ||
+ | |||
+ | [https://yadi.sk/i/2w5_Ts2p3K7jQb Exam questions] | ||
== Course description == | == Course description == | ||
Строка 21: | Строка 23: | ||
# Final exam | # Final exam | ||
− | == | + | == Kaggle competition == |
+ | Follow this link to [https://kaggle.com/join/HSE_Competition participate]. | ||
− | + | Baseline loss: 0.89 | |
− | + | ||
− | + | [https://yadi.sk/d/vfVkVE__3JSbMT Baseline solution.] | |
− | + | ||
− | + | [https://docs.google.com/spreadsheets/d/1gfmWjrigiVbIjBv9G40q3n_OHHaeRg6gf0o2vqFxGzk/edit?usp=sharing Evaluation results] | |
− | + | ||
− | + | == Colloquium == | |
− | + | Colloquium will be held on April 7th during lecture & seminars time slot. | |
− | + | ||
− | + | You may not use any materials during colloquium except single A4 prepared before the exam and handwritten personally by you (from two sides). You will have 2 questions from the [https://yadi.sk/i/myp4_88l3GJDwe questions list] with 25 minutes for preparation and may receive additional questions or tasks. | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
== Lecture materials == | == Lecture materials == | ||
Строка 46: | Строка 44: | ||
'''Lecture 2. Metric methods of classification & regression.''' | '''Lecture 2. Metric methods of classification & regression.''' | ||
[https://yadi.sk/i/D4YvGVx739oAZ8 Download] | [https://yadi.sk/i/D4YvGVx739oAZ8 Download] | ||
+ | |||
+ | '''Lecture 3. Decision trees.''' | ||
+ | [https://yadi.sk/i/sUJamzb83Ao4xL Download] | ||
+ | |||
+ | '''Lecture 4. Regression methods.''' | ||
+ | [https://yadi.sk/i/mvozzv_s3CReQQ Download] | ||
+ | |||
+ | '''Lecture 5. Properties of convex functions.''' | ||
+ | [https://yadi.sk/i/zCcyHHCz3DhLTx Download] | ||
+ | |||
+ | '''Lecture 6. Linear methods of classification.''' | ||
+ | [https://yadi.sk/i/CsWN0nwW3DhDJp Download] | ||
+ | |||
+ | '''Lecture 7. Classifier evaluation.''' | ||
+ | [https://yadi.sk/i/r3tYJwZ73FNNWk Download] | ||
+ | |||
+ | '''Lecture 8. SVM and kernel trick.''' | ||
+ | [https://yadi.sk/i/f35uPeOe3FzvKg Download] | ||
+ | |||
+ | '''Lecture 9. Principal component analysis.''' | ||
+ | [https://yadi.sk/i/dfImvp3O3Gwoaw Download] | ||
+ | |||
+ | '''Lecture 10. Singular value decomposition.''' | ||
+ | [https://yadi.sk/i/znlevSwu3Gwobb Download] | ||
+ | |||
+ | '''Lecture 11. Feature selection.''' | ||
+ | [https://yadi.sk/i/lyx3mLCc3HBwML Download] | ||
+ | |||
+ | '''Lecture 12. Working with text.''' | ||
+ | [https://yadi.sk/i/NmoN6-I43HBwN5 Download] | ||
+ | |||
+ | '''Lecture 13. Ensemble methods.''' | ||
+ | [https://yadi.sk/i/5K8fg85Q3HRbEF Download] | ||
+ | |||
+ | '''Lecture 14. Boosting.''' | ||
+ | [https://yadi.sk/i/IT2V7Dfe3J4Qxh Download] | ||
+ | |||
+ | '''Lecture 15. Neural networks.''' | ||
+ | [https://yadi.sk/i/ZOPEDMW03JJAaj Download] | ||
+ | |||
+ | '''Lecture 16. Clustering.''' | ||
+ | [https://yadi.sk/i/Cb3IlxK83JXGZq Download] | ||
+ | |||
+ | '''Lecture 17. Mixture density models.''' | ||
+ | [https://yadi.sk/i/1h3fvfmn3JXH5R Download] | ||
+ | |||
+ | '''Lecture 18. Clustering evaluation.''' | ||
+ | [https://yadi.sk/i/HPt4xzPP3JXGo6 Download] | ||
== Seminars == | == Seminars == | ||
Строка 57: | Строка 103: | ||
'''Seminar 2. Metric Classifiers ''' | '''Seminar 2. Metric Classifiers ''' | ||
− | [https://drive.google.com/file/d/0B7TWwiIrcJstVmJ5NnNBY3YwV2c/view?usp=sharing | + | [https://drive.google.com/file/d/0B7TWwiIrcJstVmJ5NnNBY3YwV2c/view?usp=sharing Theoretical task 2], Deadline: January 26 |
+ | |||
+ | [https://drive.google.com/file/d/0B7TWwiIrcJstUTg0czdlVkpMaWc/view?usp=sharing Practical task 2], [https://drive.google.com/open?id=0B7TWwiIrcJstSEZOZzBoQUo3bk0 data]. Deadline: February 2 | ||
+ | |||
+ | '''Seminar 3. Decision trees ''' | ||
+ | |||
+ | [https://drive.google.com/file/d/0B7TWwiIrcJstelVRUVBXNktKenc/view?usp=sharing Theoretical task 3], Deadline: February 2 | ||
+ | |||
+ | '''Seminar 4. Regression methods ''' | ||
+ | |||
+ | [https://drive.google.com/file/d/0B7TWwiIrcJsta29CeXdfcXFwWFU/view?usp=sharing Theoretical task 4], Deadline: February 9 | ||
+ | |||
+ | '''Seminar 5. Linear classification: loss functions ''' | ||
+ | |||
+ | [https://drive.google.com/file/d/0B7TWwiIrcJstUzQxWndJaTB1cGM/view?usp=sharing Theoretical task 5], Deadline: February 16 | ||
+ | |||
+ | '''Seminar 6. Linear classification: optimization ''' | ||
+ | |||
+ | [https://drive.google.com/file/d/0B7TWwiIrcJstaHFVRFV2Z0F0ZWs/view?usp=sharing Theoretical task 6], Deadline: March 2 | ||
+ | |||
+ | [https://drive.google.com/file/d/0B7TWwiIrcJstRjNSbXhxWDZfRzA/view?usp=sharing Practical task 6], [https://drive.google.com/file/d/0B7TWwiIrcJstbGZZTDNjazg5Mmc/view?usp=sharing first dataset], [https://drive.google.com/file/d/0B7TWwiIrcJstZ1M3NmhXU0EwSlE/view?usp=sharing diabetes dataset]. Deadline: March 16 | ||
+ | |||
+ | '''Seminar 7. Classifier evaluation''' | ||
+ | |||
+ | [https://drive.google.com/file/d/0B4DmUfeAdxyZUF9FMlVWYUVmQUE/view?usp=sharing Theoretical task 7], Deadline: March 16 | ||
+ | |||
+ | '''Seminar 8. SVM and kernel trick''' | ||
+ | |||
+ | [https://drive.google.com/file/d/0B7TWwiIrcJstazFWSXVSREJzSGc/view?usp=sharing Theoretical task 8], Deadline: March 23 | ||
+ | |||
+ | [https://drive.google.com/file/d/0B7TWwiIrcJstOVotNEhnYUJYdGs/view?usp=sharing Practical task 8], [https://drive.google.com/file/d/0B7TWwiIrcJstU0gxeDhYX1hHN1E/view?usp=sharing data]. Deadline: March 30 | ||
+ | |||
+ | '''Seminar 9. PCA''' | ||
+ | |||
+ | [https://drive.google.com/file/d/0B4DmUfeAdxyZT1U0aUhjQmphZDA/view?usp=sharing Theoretical task 9], Deadline: April 20 | ||
+ | |||
+ | '''Seminar 10. Feature selection + text mining''' | ||
+ | |||
+ | [https://drive.google.com/file/d/0B4DmUfeAdxyZdHk1MmFxUlVwNm8/view?usp=sharing Theoretical task 10], Deadline: April 27 | ||
+ | |||
+ | '''Seminar 11. Ensembles, bagging''' | ||
+ | |||
+ | [https://drive.google.com/file/d/0B4DmUfeAdxyZVjk1c2dIWmVSUkk/view?usp=sharing Practical task 11], Deadline: May 18 | ||
+ | |||
+ | '''Seminar 12. Ensembles, boosting''' | ||
+ | |||
+ | [https://drive.google.com/file/d/0B4DmUfeAdxyZTFJGZ2VmcURVZDg/view?usp=sharing Theoretical task 12], Deadline: May 18 | ||
+ | |||
+ | '''Seminar 13. Neural networks''' | ||
+ | |||
+ | [https://drive.google.com/file/d/0B7TWwiIrcJstTnJTd2d0QldneDA/view?usp=sharing Practical task 13], [https://drive.google.com/open?id=0B7TWwiIrcJstMEU3bXkxY1hwZ00 Data], [https://drive.google.com/open?id=0B7TWwiIrcJstcjB4eFJaUkxDdEE Short training set], Deadline: June 6 | ||
+ | |||
+ | Additional materials: [https://drive.google.com/open?id=0B7TWwiIrcJstbHFYbFcwTkNpNWc Backpropagation], [http://pybrain.org/docs/ PyBrain’s documentation], [https://drive.google.com/open?id=0B7TWwiIrcJstOTZidU1BREtSRjA PyBrain example from the seminar] | ||
+ | |||
+ | ''By default, you should use the whole training set from Data. But if you have MemoryError then use Short training set. '' | ||
+ | |||
+ | '''Seminar 14. Clustering''' | ||
− | [https://drive.google.com/file/d/ | + | [https://drive.google.com/file/d/0B7TWwiIrcJstSjRBMlNLSW00YlU/view?usp=sharing Theoretical task 14], Deadline: June 1 |
== Evaluation criteria == | == Evaluation criteria == | ||
Строка 80: | Строка 182: | ||
* S<sub>competition</sub> – score for the competition in machine learning (it's also from 0 to 1). | * S<sub>competition</sub> – score for the competition in machine learning (it's also from 0 to 1). | ||
+ | If you solve the theoretical problem in class you obtain 1.5 points (if you solve it at home you obtain 1 point). | ||
Participation in machine learning competition is optional and can give students extra points. | Participation in machine learning competition is optional and can give students extra points. | ||
Строка 109: | Строка 212: | ||
== Useful links == | == Useful links == | ||
=== Machine learning === | === Machine learning === | ||
+ | * [https://github.com/esokolov/ml-course-hse Machine learning course from Evgeny Sokolov on Github] | ||
* [http://www.machinelearning.ru/wiki/index.php?title=Заглавная_страница machinelearning.ru] | * [http://www.machinelearning.ru/wiki/index.php?title=Заглавная_страница machinelearning.ru] | ||
* [https://yandexdataschool.ru/edu-process/courses/machine-learning Video-lectures of K. Vorontsov on machine learning] | * [https://yandexdataschool.ru/edu-process/courses/machine-learning Video-lectures of K. Vorontsov on machine learning] |
Текущая версия на 19:44, 9 декабря 2017
Class email: cshse.ml@gmail.com
Anonymous feedback form: here
Scores: here
Содержание
Exam questions
Course description
In this class we consider the main problems of data mining and machine learning: classification, clustering, regression, dimensionality reduction, ranking, collaborative filtering. We will also study mathematical methods and concepts which data analysis is based on as well as formal assumptions behind them and various aspects of their implementation.
A significant attention is given to practical skills of data analysis that will be developed on seminars by studying the Python programming language and relevant libraries for scientific computing.
The knowledge of linear algebra, real analysis and probability theory is required.
The class consists of:
- Lectures and seminars
- Practical and theoretical homework assignments
- A machine learning competition (more information will be available later)
- Midterm theoretical colloquium
- Final exam
Kaggle competition
Follow this link to participate.
Baseline loss: 0.89
Colloquium
Colloquium will be held on April 7th during lecture & seminars time slot.
You may not use any materials during colloquium except single A4 prepared before the exam and handwritten personally by you (from two sides). You will have 2 questions from the questions list with 25 minutes for preparation and may receive additional questions or tasks.
Lecture materials
Lecture 1. Introduction to data science and machine learning. Download
Lecture 2. Metric methods of classification & regression. Download
Lecture 3. Decision trees. Download
Lecture 4. Regression methods. Download
Lecture 5. Properties of convex functions. Download
Lecture 6. Linear methods of classification. Download
Lecture 7. Classifier evaluation. Download
Lecture 8. SVM and kernel trick. Download
Lecture 9. Principal component analysis. Download
Lecture 10. Singular value decomposition. Download
Lecture 11. Feature selection. Download
Lecture 12. Working with text. Download
Lecture 13. Ensemble methods. Download
Lecture 14. Boosting. Download
Lecture 15. Neural networks. Download
Lecture 16. Clustering. Download
Lecture 17. Mixture density models. Download
Lecture 18. Clustering evaluation. Download
Seminars
Seminar 1. Introduction to Data Analysis in Python
Practical task 1, data. Deadline: January 19.
Seminar 2. Metric Classifiers
Theoretical task 2, Deadline: January 26
Practical task 2, data. Deadline: February 2
Seminar 3. Decision trees
Theoretical task 3, Deadline: February 2
Seminar 4. Regression methods
Theoretical task 4, Deadline: February 9
Seminar 5. Linear classification: loss functions
Theoretical task 5, Deadline: February 16
Seminar 6. Linear classification: optimization
Theoretical task 6, Deadline: March 2
Practical task 6, first dataset, diabetes dataset. Deadline: March 16
Seminar 7. Classifier evaluation
Theoretical task 7, Deadline: March 16
Seminar 8. SVM and kernel trick
Theoretical task 8, Deadline: March 23
Practical task 8, data. Deadline: March 30
Seminar 9. PCA
Theoretical task 9, Deadline: April 20
Seminar 10. Feature selection + text mining
Theoretical task 10, Deadline: April 27
Seminar 11. Ensembles, bagging
Practical task 11, Deadline: May 18
Seminar 12. Ensembles, boosting
Theoretical task 12, Deadline: May 18
Seminar 13. Neural networks
Practical task 13, Data, Short training set, Deadline: June 6
Additional materials: Backpropagation, PyBrain’s documentation, PyBrain example from the seminar
By default, you should use the whole training set from Data. But if you have MemoryError then use Short training set.
Seminar 14. Clustering
Theoretical task 14, Deadline: June 1
Evaluation criteria
The course lasts during the 3rd and 4th modules. Knowledge of students is assessed by evaluation of their home assignments and exams. Home assignments divide into theoretical tasks and practical tasks. There are two exams during the course – after the 3rd module and after the 4th module respectively. Each of the exams evaluates theoretical knowledge and understanding of the material studied during the respective module.
Grade takes values 4,5,…10. Grades, corresponding to 1,2,3 are assumed unsatisfactory. Exact grades are calculated using the following rule:
- score ≥ 35% => 4,
- score ≥ 45% => 5,
- ...
- score ≥ 95% => 10,
where score is calculated using the following rule:
score = 0.6 * Shomework + 0.2 * Sexam1 + 0.2 * Sexam2 + 0.2 * Scompetition
- Shomework – proportion of correctly solved homework,
- Sexam1 – proportion of successfully answered theoretical questions during exam after module 3,
- Sexam2 – proportion of successfully answered theoretical questions during exam after module 4,
- Scompetition – score for the competition in machine learning (it's also from 0 to 1).
If you solve the theoretical problem in class you obtain 1.5 points (if you solve it at home you obtain 1 point). Participation in machine learning competition is optional and can give students extra points.
Plagiarism
In case of discovered plagiarism zero points will be set for the home assignemets - for both works, which were found to be identical. In case of repeated plagiarism by one and the same person a report to the dean will be made.
Deadlines
Standard period for working on a homework assignment is 2 weeks for practical assignments and 1 week for theoretical ones. The first practical assignment is an exception. Assignments sent after late deadlines will not be scored (assigned with zero score) in the absence of legitimate reasons for late submission which do not include high load on other classes.
Deadline time: 23:59 of the day before seminar (Thursday).
Structure of emails and homework submissions
All the questions and submissions must be addressed to cshse.ml@gmail.com. The following subjects must be used:
- For questions (general, regarding assignments, etc): "Question - Surname Name - Group"
- For homework submissions: "Practice/Theory {Lab number} - Surname Name - Group"
Example: Practice 1 - Ivanov Ivan - 141
Please do not mix two different topics in a single email such as theoretical and practical assignments etc. When replying, please use the same thread (i.e. reply to the same email).
Practical assignments must be implemented in jupyter notebook format, theoretical ones in pdf. Practical assignments must use Python 3. Use your surname as a filename for assignments (e.g. Ivanov.ipynb). Do not archive your assignments.
Assignments can be performed in either Russian or English.
Assignments can be submitted only once!
Useful links
Machine learning
- Machine learning course from Evgeny Sokolov on Github
- machinelearning.ru
- Video-lectures of K. Vorontsov on machine learning
- On of the classic ML books. Elements of Statistical Learning (Trevor Hastie, Robert Tibshirani, Jerome Friedman)
Python
- Official website
- Libraries: NumPy, Pandas, SciKit-Learn, Matplotlib.
- A little example for the begginers: краткое руководство с примерами по Python 2
- Python from scratch: A Crash Course in Python for Scientists
- Lectures Scientific Python
- A book: Wes McKinney «Python for Data Analysis»
- Коллекция интересных IPython ноутбуков