Data analysis (Software Engineering) 2017 — различия между версиями

Версия 12:55, 21 апреля 2017

Class email: cshse.ml@gmail.com
Anonymous feedback form: here
Scores: here

Содержание

1 Course description
2 Kaggle competition
3 Colloquium
4 Syllabus
5 Lecture materials
6 Seminars
7 Evaluation criteria
8 Plagiarism
9 Deadlines
10 Structure of emails and homework submissions
11 Useful links

Course description

In this class we consider the main problems of data mining and machine learning: classification, clustering, regression, dimensionality reduction, ranking, collaborative filtering. We will also study mathematical methods and concepts which data analysis is based on as well as formal assumptions behind them and various aspects of their implementation.

A significant attention is given to practical skills of data analysis that will be developed on seminars by studying the Python programming language and relevant libraries for scientific computing.

The knowledge of linear algebra, real analysis and probability theory is required.

The class consists of:

Lectures and seminars
Practical and theoretical homework assignments
A machine learning competition (more information will be available later)
Midterm theoretical colloquium
Final exam

Kaggle competition

Full data and rules will be availbale on kaggle platform in several days. Link to train set.

Colloquium

Colloquium will be held on April 7th during lecture & seminars time slot.

You may not use any materials during colloquium except single A4 prepared before the exam and handwritten personally by you (from two sides). You will have 2 questions from the questions list with 25 minutes for preparation and may receive additional questions or tasks.

Syllabus

Introduction to machine learning.
K-nearest neighbours classification and regression. Extensions. Optimization techniques.
Decision tree methods.
Bayesian decision theory. Model evaluation:
Linear classification methods. Adding regularization to linear methods.
Regression.
Kernel generalization of standard methods.
Neural networks.
Ensemble methods: bagging, boosting, etc.
Feature selection.
Feature extraction
EM algorithm. Density estimation using mixtures.
Clustering
Collaborative filtering
Ranking

Lecture materials

Lecture 1. Introduction to data science and machine learning. Download

Lecture 2. Metric methods of classification & regression. Download

Lecture 3. Decision trees. Download

Lecture 4. Regression methods. Download

Lecture 5. Properties of convex functions. Download

Lecture 6. Linear methods of classification. Download

Lecture 7. Classifier evaluation. Download

Lecture 8. SVM and kernel trick. Download

Lecture 9. Principal component analysis. Download

Lecture 10. Singular value decomposition. Download

Lecture 11. Feature selection. Download

Lecture 12. Working with text. Download

Seminars

Seminar 1. Introduction to Data Analysis in Python

Additional materials: 1, 2

Practical task 1, data. Deadline: January 19.

Seminar 2. Metric Classifiers

Theoretical task 2, Deadline: January 26

Practical task 2, data. Deadline: February 2

Seminar 3. Decision trees

Theoretical task 3, Deadline: February 2

Seminar 4. Regression methods

Theoretical task 4, Deadline: February 9

Seminar 5. Linear classification: loss functions

Theoretical task 5, Deadline: February 16

Seminar 6. Linear classification: optimization

Theoretical task 6, Deadline: March 2

Practical task 6, first dataset, diabetes dataset. Deadline: March 16

Seminar 7. Classifier evaluation

Theoretical task 7, Deadline: March 16

Seminar 8. SVM and kernel trick

Theoretical task 8, Deadline: March 23

Practical task 8, data. Deadline: March 30

Seminar 9. PCA

Theoretical task 9, Deadline: April 20

Seminar 10. Feature selection + text mining

Theoretical task 10, Deadline: April 27

Evaluation criteria

The course lasts during the 3rd and 4th modules. Knowledge of students is assessed by evaluation of their home assignments and exams. Home assignments divide into theoretical tasks and practical tasks. There are two exams during the course – after the 3rd module and after the 4th module respectively. Each of the exams evaluates theoretical knowledge and understanding of the material studied during the respective module.

Grade takes values 4,5,…10. Grades, corresponding to 1,2,3 are assumed unsatisfactory. Exact grades are calculated using the following rule:

score ≥ 35% => 4,
score ≥ 45% => 5,
...
score ≥ 95% => 10,

where score is calculated using the following rule:

score = 0.6 * S_homework + 0.2 * S_exam1 + 0.2 * S_exam2 + 0.2 * S_competition

S_homework – proportion of correctly solved homework,
S_exam1 – proportion of successfully answered theoretical questions during exam after module 3,
S_exam2 – proportion of successfully answered theoretical questions during exam after module 4,
S_competition – score for the competition in machine learning (it's also from 0 to 1).

If you solve the theoretical problem in class you obtain 1.5 points (if you solve it at home you obtain 1 point). Participation in machine learning competition is optional and can give students extra points.

Plagiarism

In case of discovered plagiarism zero points will be set for the home assignemets - for both works, which were found to be identical. In case of repeated plagiarism by one and the same person a report to the dean will be made.

Deadlines

Standard period for working on a homework assignment is 2 weeks for practical assignments and 1 week for theoretical ones. The first practical assignment is an exception. Assignments sent after late deadlines will not be scored (assigned with zero score) in the absence of legitimate reasons for late submission which do not include high load on other classes.

Deadline time: 23:59 of the day before seminar (Thursday).

Structure of emails and homework submissions

All the questions and submissions must be addressed to cshse.ml@gmail.com. The following subjects must be used:

For questions (general, regarding assignments, etc): "Question - Surname Name - Group"
For homework submissions: "Practice/Theory {Lab number} - Surname Name - Group"

Example: Practice 1 - Ivanov Ivan - 141

Please do not mix two different topics in a single email such as theoretical and practical assignments etc. When replying, please use the same thread (i.e. reply to the same email).

Practical assignments must be implemented in jupyter notebook format, theoretical ones in pdf. Practical assignments must use Python 3. Use your surname as a filename for assignments (e.g. Ivanov.ipynb). Do not archive your assignments.

Assignments can be performed in either Russian or English.

Assignments can be submitted only once!

Useful links

Machine learning

Python

Official website
Libraries: NumPy, Pandas, SciKit-Learn, Matplotlib.
A little example for the begginers: краткое руководство с примерами по Python 2
Python from scratch: A Crash Course in Python for Scientists
Lectures Scientific Python
A book: Wes McKinney «Python for Data Analysis»
Коллекция интересных IPython ноутбуков

Python installation and configuration

anaconda

@@ Строка 129: / Строка 129: @@
 [https://drive.google.com/file/d/0B4DmUfeAdxyZT1U0aUhjQmphZDA/view?usp=sharing Theoretical task 9], Deadline: April 20
+'''Seminar 10. Feature selection + text mining'''
+[https://drive.google.com/file/d/0B4DmUfeAdxyZdHk1MmFxUlVwNm8/view?usp=sharing Theoretical task 10], Deadline: April 27

Data analysis (Software Engineering) 2017 — различия между версиями

Версия 12:55, 21 апреля 2017

Содержание

Course description

Kaggle competition

Colloquium

Syllabus

Lecture materials

Seminars

Evaluation criteria

Plagiarism

Deadlines

Structure of emails and homework submissions

Useful links

Machine learning

Python

Python installation and configuration

Навигация

Персональные инструменты

Пространства имён

Варианты

Просмотры

Действия

Поиск

Навигация

Инструменты