Into to DataMining and Machine Learning 2020 2021
Содержание
- 1 Course: Introduction to Data Mining and Machine Learning (2020–2021)
- 1.1 Homeworks
- 1.2 Lecture on 12 January 2021
- 1.3 Lecture on 19 January 2021
- 1.4 Lecture on 26 January 2021
- 1.5 Lecture on 2 February 2021
- 1.6 Lecture on 09 February 2021
- 1.7 Practice on 16 Feb 2021
- 1.8 Lecture on 2 March 2021
- 1.9 Lecture on 9 March 2021
- 1.10 Lecture + Practice on 16 March 2021
- 1.11 Practice on 6 April 2021
- 1.12 Lecture on 13 April 2021
- 1.13 Lecture + Practice on 25 April 2021
- 1.14 Lecture on 11 May 2021
- 1.15 Practice plus Lecture on 18 May 2021
- 1.16 Exam
Course: Introduction to Data Mining and Machine Learning (2020–2021)
Lecturer: Dmitry Ignatov
TA: Stefan Nikolić
All the material are available via our t
Final mark formula: FM = 0.8 Homeworks + 0.2 Exam.
Homeworks
- Homework 1: Graph Spectral Clustering
- Homework 2: Classification or Frequent Itemset Mining or Clustering
- Homework 3: Recommender Systems
Lecture on 12 January 2021
Intro slides. Course plan. Assessment criteria. ML&DM libraries. What to read and watch?
Practice: demonstration with Orange.
Lecture on 19 January 2021
Classification. One-rule. Naïve Bayes. kNN. Logistic Regression. Train-test split and cross-validation. Quality Metrics (TP, FP, TN, FN, Precision, Recall, F-measure, Accuracy).
Practice: demonstration with Orange.
Lecture on 26 January 2021
Classification (continued). Quality metrics. ROC curves.
Practice: demonstration with Orange.
Lecture on 2 February 2021
Introduction to Clustering. Taxonomy of clustering methods. K-means. K-medoids. Fuzzy C-means. Types of distance metrics. Hierarchical clustering. DBScan
Practice: DBScan Demo.
Lecture on 09 February 2021
- Introduction to Clustering (continued). Density-based techniques. DBScan and Mean-shift.
- Graph and spectral clustering. Min-cuts and normalized cuts. Laplacian matrix. Fiedler vector. Two-mode Spectral Clustering (Spectral Biclustering). Applications: Web Advertising, Community detection in Social Networks, Music Recommendations.
Practice on 16 Feb 2021
Clustering with scikit-learn (k-means, hierarchical clustering, DBScan, MeanShift, Spectral Clustering).
Lecture on 2 March 2021
Practice: Spectral clustering.
Lecture: Decision tree learning. ID3. Information Entropy. Information gain. Gini coefficient and index. Overfitting and pruning. Decision trees for numeric data. Oblivious decision trees. Regression trees.
Lecture on 9 March 2021
Frequent Itemsets. Association Rules. Algorithms: Apriori, FP-growth. Interestingness measures. Closed and maximal itemsets.
Lecture + Practice on 16 March 2021
Frequent Itemset Mining (continued). Applications: 1) Taxonomies of Website Visitors and 2) Web advertising.
Exercises. Frequent Itemsets. FP-growth. Closed itemsets.
Practice. Orange, SPMF, Concept Explorer.
Practice on 6 April 2021
Practice. Scikit-learn tutorial on kNN, Decision Trees, NaÏveBayes, Logistic Regression, SVM etc.
Lecture on 13 April 2021
Introduction to Recommender systems. Taxonomy of Recommender Systems (non-personalised, content-based, collaborative filtering, hybrid etc). Real Examples. User-based and item-based collaborative filtering. Bimodal cross-validation.
Lecture + Practice on 25 April 2021
Practice: User-based and item-based collaborative filtering with Python and MovieLens.
Case-study: Non-negative Matrix Factorisation, Boolean Matrix Factorisation vs. SVD in Collaborative Filtering.
Lecture: Advanced factorisation models: PureSVD, SVD++, timeSVD, ALS.
Lecture on 11 May 2021
- Advanced factorisation models: Factorisation Machines (continued).
- Supervised Ensemble Learning. Bias-Variance decomposition. Bagging. Random Forest. Boosting for classification (AdaBoost) and regression. Stacking and Blending. Recommendation of Classifiers.
Practice plus Lecture on 18 May 2021
Practice: Bagging, Pasting, Random Projections, and Patching. Random Forest and Extra Trees. Gradient Boosting. Voting.
Lecture on Gradient Boosting.
Exam
- Date: 29.06.2021. Starting time: 11:00. Location: remote exam (see the channel announcements).
- Format: One-to-one meeting in Zoom with the lecturer or the course TA. On average, you will be given one theoretical quick question and one small finger exercise.
If you marks for HWs are satisfactory, participation in the exam is your choice but recommended.
- Questions.
What is and how does it work questions based on the studied topics.
- Taxonomy of DM and ML methods.
- Classification. One-rule and Decision Stumps. Decision Trees. ID3 algorithm.
- Classification. Naïve Bayes. Smoothing.
- Classification. KNN
- Classification. Logistic regression.
- Classification quality metrics. ROC and AUC.
- Clustering. k-means and k-medoids. Fuzzy c-means.
- Clustering. Hierarchical clustering.
- Clustering. DBScan and Mean-Shift.
- Clustering quality metrics. Silhouette. Elbow method. Cophenetic distance. Calinski and Harabasz score.
- Spectral Clustering. Laplacian graph transformation and min-cuts.
- Decision Trees. ID3. Information gain and Gini index.
- Ensemble Learning. Bias and variance decomposition. Overfitting.
- Ensemble Learning. Bagging.
- Ensemble Learning. Boosting. AdaBoost.
- Ensemble Learning. Random Forest.
- Ensemble Learning. Gradient Boosting.
- Data Mining. Frequent Itemset Mining and Association Rules. Interestinngess Measures. Closed and Maximal Itemsets.
- Data Mining. Frequent Itemset Mining and Association Rules. Apriori vs. FP-growth.
- Recommender Systems. Collaborative Filtering. Item-based and user-based techniques. Quality metrics and bimodal cross-validation.
- Recommender Systems. NMF, Boolean Matrix Factorisation and SVD for Collaborative Filtering.
- Recommender Systems. Advances in matrix factorisation: PureSVD, SVD++, timeSVD, ALS, Factorisation Machines.
- Small tasks.
Examples of finger exercises with pen and pencil.
- Given a small dataset 5 x 4, find its most informative attributes based on Information Gain and Gini Index.
- Given a toy set of transactions, find no less than three association rules with a given support and confidence.
- Given a tiny user-item table, find the top three recommendations for a given user by user-based and item-based approaches.
- Given a little matrix of user-item interactions, find its product into Boolean matrices of preferably smaller second dimensions.