Into to DataMining and Machine Learning 2020 2021

Материал из Wiki - Факультет компьютерных наук
Перейти к: навигация, поиск

Course: Introduction to Data Mining and Machine Learning (2020–2021)

Lecturer: Dmitry Ignatov

TA: Stefan Nikolić

All the material are available via our t

Final mark formula: FM = 0.8 Homeworks + 0.2 Exam.


Homeworks

  • Homework 1: Graph Spectral Clustering
  • Homework 2: Classification or Frequent Itemset Mining or Clustering
  • Homework 3: Recommender Systems

Lecture on 12 January 2021

Intro slides. Course plan. Assessment criteria. ML&DM libraries. What to read and watch?

Practice: demonstration with Orange.

Lecture on 19 January 2021

Classification. One-rule. Naïve Bayes. kNN. Logistic Regression. Train-test split and cross-validation. Quality Metrics (TP, FP, TN, FN, Precision, Recall, F-measure, Accuracy).

Practice: demonstration with Orange.

Lecture on 26 January 2021

Classification (continued). Quality metrics. ROC curves.

Practice: demonstration with Orange.

Lecture on 2 February 2021

Introduction to Clustering. Taxonomy of clustering methods. K-means. K-medoids. Fuzzy C-means. Types of distance metrics. Hierarchical clustering. DBScan

Practice: DBScan Demo.

Lecture on 09 February 2021

  • Introduction to Clustering (continued). Density-based techniques. DBScan and Mean-shift.
  • Graph and spectral clustering. Min-cuts and normalized cuts. Laplacian matrix. Fiedler vector. Two-mode Spectral Clustering (Spectral Biclustering). Applications: Web Advertising, Community detection in Social Networks, Music Recommendations.

Practice on 16 Feb 2021

Clustering with scikit-learn (k-means, hierarchical clustering, DBScan, MeanShift, Spectral Clustering).

Lecture on 2 March 2021

Practice: Spectral clustering.

Lecture: Decision tree learning. ID3. Information Entropy. Information gain. Gini coefficient and index. Overfitting and pruning. Decision trees for numeric data. Oblivious decision trees. Regression trees.

Lecture on 9 March 2021

Frequent Itemsets. Association Rules. Algorithms: Apriori, FP-growth. Interestingness measures. Closed and maximal itemsets.

Lecture + Practice on 16 March 2021

Frequent Itemset Mining (continued). Applications: 1) Taxonomies of Website Visitors and 2) Web advertising.

Exercises. Frequent Itemsets. FP-growth. Closed itemsets.

Practice. Orange, SPMF, Concept Explorer.

Practice on 6 April 2021

Practice. Scikit-learn tutorial on kNN, Decision Trees, NaÏveBayes, Logistic Regression, SVM etc.

Lecture on 13 April 2021

Introduction to Recommender systems. Taxonomy of Recommender Systems (non-personalised, content-based, collaborative filtering, hybrid etc). Real Examples. User-based and item-based collaborative filtering. Bimodal cross-validation.

Lecture + Practice on 25 April 2021

Practice: User-based and item-based collaborative filtering with Python and MovieLens.

Case-study: Non-negative Matrix Factorisation, Boolean Matrix Factorisation vs. SVD in Collaborative Filtering.

Lecture: Advanced factorisation models: PureSVD, SVD++, timeSVD, ALS.

Lecture on 11 May 2021

  • Advanced factorisation models: Factorisation Machines (continued).
  • Supervised Ensemble Learning. Bias-Variance decomposition. Bagging. Random Forest. Boosting for classification (AdaBoost) and regression. Stacking and Blending. Recommendation of Classifiers.

Practice plus Lecture on 18 May 2021

Practice: Bagging, Pasting, Random Projections, and Patching. Random Forest and Extra Trees. Gradient Boosting. Voting.

Lecture on Gradient Boosting.

Exam

  • Date: 29.06.2021. Starting time: 11:00. Location: remote exam (see the channel announcements).
  • Format: One-to-one meeting in Zoom with the lecturer or the course TA. On average, you will be given one theoretical quick question and one small finger exercise.

If you marks for HWs are satisfactory, participation in the exam is your choice but recommended.

  • Questions.

What is and how does it work questions based on the studied topics.

  1. Taxonomy of DM and ML methods.
  2. Classification. One-rule and Decision Stumps. Decision Trees. ID3 algorithm.
  3. Classification. Naïve Bayes. Smoothing.
  4. Classification. KNN
  5. Classification. Logistic regression.
  6. Classification quality metrics. ROC and AUC.
  7. Clustering. k-means and k-medoids. Fuzzy c-means.
  8. Clustering. Hierarchical clustering.
  9. Clustering. DBScan and Mean-Shift.
  10. Clustering quality metrics. Silhouette. Elbow method. Cophenetic distance. Calinski and Harabasz score.
  11. Spectral Clustering. Laplacian graph transformation and min-cuts.
  12. Decision Trees. ID3. Information gain and Gini index.
  13. Ensemble Learning. Bias and variance decomposition. Overfitting.
  14. Ensemble Learning. Bagging.
  15. Ensemble Learning. Boosting. AdaBoost.
  16. Ensemble Learning. Random Forest.
  17. Ensemble Learning. Gradient Boosting.
  18. Data Mining. Frequent Itemset Mining and Association Rules. Interestinngess Measures. Closed and Maximal Itemsets.
  19. Data Mining. Frequent Itemset Mining and Association Rules. Apriori vs. FP-growth.
  20. Recommender Systems. Collaborative Filtering. Item-based and user-based techniques. Quality metrics and bimodal cross-validation.
  21. Recommender Systems. NMF, Boolean Matrix Factorisation and SVD for Collaborative Filtering.
  22. Recommender Systems. Advances in matrix factorisation: PureSVD, SVD++, timeSVD, ALS, Factorisation Machines.
  • Small tasks.

Examples of finger exercises with pen and pencil.

  1. Given a small dataset 5 x 4, find its most informative attributes based on Information Gain and Gini Index.
  2. Given a toy set of transactions, find no less than three association rules with a given support and confidence.
  3. Given a tiny user-item table, find the top three recommendations for a given user by user-based and item-based approaches.
  4. Given a little matrix of user-item interactions, find its product into Boolean matrices of preferably smaller second dimensions.