Data analysis (Software Engineering)

Материал из Wiki - Факультет компьютерных наук
Перейти к: навигация, поиск

Scores and deadlines: [--- here]

Class email: cshse.ml@gmail.com

Anonymous feedback form: here

Announcements

Course description

In this class we consider the main problems of machine learning and data analysis: classification, regression, dimensionality reduction, clustering, collaborative filtering. We will also study mathematical methods and concepts which machine learning is based on as well as formal assumptions behind them and various aspects of their implementation.

A significant attention is given to practical skills of data analysis that will be developed on seminars by studying the Python programming language and relevant libraries for scientific computing.

Prerequisites

Firm knowledge of linear algebra, mathematical analysis and probability theory is required.

The class consists of:

  1. Lectures and seminars
  2. Practical and theoretical homework assignments
  3. A machine learning competition (more information will be available later)
  4. Midterm theoretical colloquium
  5. Final exam

Syllabus

  1. Introduction. Core concepts of machine learning.
  2. K-nearest neighbours method.
  3. Decision tree methods.
  4. Regression methods. Regularization.
  5. Convex functions. Classification with linear methods. Loss functions.
  6. Classification with linear methods. Gradient descent and stochastic gradient descent. Regularization.
  7. Model evaluation. Logistic regression.
  8. Support vector machines.
  9. Generalization with kernels.
  10. Linear dimensionality reduction - PCA, SVD.
  11. Feature selection.
  12. Ensemble methods.
  13. Boosting. xgBoost.
  14. Neural networks - architecture.
  15. Neural networks - optimization.
  16. Clustering. EM algorithm for Gaussian mixtures.
  17. Collaborative filtering
  18. Bayes decision theory. Naive Bayes assumption. Kernel density estimation.
  19. Semi-supervised learning
  20. Active learnning

Lecture materials

Lecture 1. Introduction to data science and machine learning.

Download

Seminars

Seminar 1. Introduction to Data Analysis in Python

Practical task 1, data

Additional materials: 1, 2

Seminar 2. kNN

Theoretical task 2, Practical task 2, data

Additional materials: Visualization tutorial

Seminar 3. Decision trees

Theoretical task 3

Seminar 4. Linear classifiers

Theoretical task 4, Practical task 4, first dataset, diabetes dataset

UPD: At all parts of practical task 4 you should use GD and SGD functions that you program at the fisrt part!

Deadline for this practical task has been changed for some groups! Check it in the table!

Seminar 5. Model evaluation

Theoretical task 5

Seminar 6. Bayesian decision rule

Theoretical task 6, Practical task 6, data

Practical task 6 was completed: the last part was described in more details + there are two small corrections in the first part (they are in bold font). Read it carefully!

Deadline for this practical task has been changed for all groups! Check it in the table!

Seminar 7. SVM and kernel trick

Theoretical task 7

Additional materials: Лекция К.В. Воронцова по SVM

Seminar 8. Regression

Theoretical task 8

Seminar 9. Boosting

Practical task 9, data

Seminar 10. Ensemble methods

Theoretical task 10

Problem 2: a small typo was corrected in the loss function formula.

Seminar 11. Summary

Seminar 12. How to solve practical problems

Dota Competition from the seminar, ipython notebook

Seminar 13. Feature selection

Theoretical task 13, Practical task 13

Practical task is completed.

Seminar 14. Feature extraction

You can read about computing PCA through SVD at the end of this paper.

Seminar 15. Neural networks

Practical task 15, Data, Data in csv format, Censored training set, Theoretical task 15

Additional materials: Backpropagation, PyBrain’s documentation, PyBrain example from the seminar

New data files have been uploaded (there were some problems with reading old ones). Therefore deadline has been changed for some groups! Check it in the table!

If you have MemoryError then read only part of training data from csv files (for example, 30000 objects). You can download censored training set (find link above) or use the following code:

mnist_train = np.loadtxt('mnist_train.csv', delimiter=',')
train_data = ClassificationDataSet(28*28, nb_classes=10)
for i in xrange(len(mnist_train)):

train_data.appendLinked(mnist_train[i, 1:] / 255., int(mnist_train[i, 0]))

train_data._convertToOneOfMany()

mnist_test = np.loadtxt('mnist_test.csv', delimiter=',')
test_data = ClassificationDataSet(28*28, nb_classes=10)
for i in xrange(len(mnist_test)):

test_data.appendLinked(mnist_test[i, 1:] / 255., int(mnist_test[i, 0]))

test_data._convertToOneOfMany()

Seminar 16. Clustering

Theoretical task 16, Practical task 16, parrots.jpg, grass.jpg

Seminar 17. Clustering, EM-algorithm

Theoretical task 17

Seminar 18. Recommender systems

Theoretical task 18,Practical task 18, data

Additional materials: Factorization Machines

Columns in the data: 0 - user, 1 - item, 2 - rating, 3 - time (you don't need this one).

In the practical task you should train models on the train data (base) and evaluate on the test data.

Evaluation criteria

The course lasts during the 3rd and 4th modules. Knowledge of students is assessed by evaluation of their home assignments and exams. Home assignments divide into theoretical tasks and practical tasks. There are two exams during the course – after the 3rd module and after the 4th module respectively. Each of the exams evaluates theoretical knowledge and understanding of the material studied during the respective module.

Grade takes values 4,5,…10. Grades, corresponding to 1,2,3 are assumed unsatisfactory. Exact grades are calculated using the following rule:

  • score ≥ 35% => 4,
  • score ≥ 45% => 5,
  • ...
  • score ≥ 95% => 10,

where score is calculated using the following rule:

score = 0.6 * Shomework + 0.2 * Sexam1 + 0.2 * Sexam2 + 0.2 * Scompetition

  • Shomework – proportion of correctly solved homework,
  • Sexam1 – proportion of successfully answered theoretical questions during exam after module 3,
  • Sexam2 – proportion of successfully answered theoretical questions during exam after module 4,
  • Scompetition – score for the competition in machine learning (it's also from 0 to 1).

Participation in machine learning competition is optional and can give students extra points.

Plagiarism

In case of discovered plagiarism zero points will be set for the home assignemets - for both works, which were found to be identical. In case of repeated plagiarism by one and the same person a report to the dean will be made.

Deadlines

All the deadlines can be found in the second tab here.

We have two deadlines for each assignments: normal and late. An assignment sent prior to normal deadline is scored with no penalty. The maximum score is penalized by 50% for assignments sent in between of the normal and the late deadline. Assignments sent after late deadlines will not be scored (assigned with zero score) in the absence of legitimate reasons for late submission which do not include high load on other classes.

Standard period for working on a homework assignment is 2 and 4 weeks (normal and late deadlines correspondingly) for practical assignments and 1 and 2 weeks for theoretical ones. The first practical assignment is an exception.

Deadline time: 23:59 of the day before seminar (Sunday for students attending Monday seminars and Wednesday for students that have seminars on Thursday).

Structure of emails and homework submissions

All the questions and submissions must be addressed to cshse.ml@gmail.com. The following subjects must be used:

  • For questions (general, regarding assignments, etc): "Question - Surname Name - Group(subgroup)"
  • For homework submissions: "Practice/Theory {Lab number} - Surname Name - Group(subgroup)"

Example: Practice 1 - Ivanov Ivan - 131(1)

If you want to address a particular teacher, mention his name in the subject.

Example: Question - Ivanov Ivan - 131(1) - Ekaterina

Please do not mix two different topics in a single email such as theoretical and practical assignments etc. When replying, please use the same thread (i.e. reply to the same email).

Practical assignments must be implemented in ipython notebook format, theoretical ones in pdf. Practical assignments must use Python 2.7. Use your surname as a filename for assignments (e.g. Ivanov.ipynb). Do not archive your assignments.

Assignments can be performed in either Russian or English.

Assignments can be submitted only once!

Useful links

Machine learning

Python

Python installation and configuration