Into to DataMining and Machine Learning 2020 2021 — различия между версиями
Machine (обсуждение | вклад) |
Machine (обсуждение | вклад) (→Exam) |
||
Строка 94: | Строка 94: | ||
=== Exam === | === Exam === | ||
− | * Date: 29.06.2021. Starting time: 11:00. | + | * Date: 29.06.2021. Starting time: 11:00. Location: remote exam. |
* Questions. | * Questions. | ||
Строка 112: | Строка 112: | ||
# Spectral Clustering. Laplacian graph transformation and min-cuts. | # Spectral Clustering. Laplacian graph transformation and min-cuts. | ||
# Decision Trees. ID3. Information gain and Gini index. | # Decision Trees. ID3. Information gain and Gini index. | ||
+ | # Ensemble Learning. Bias and variance decomposition. Overfitting. | ||
# Ensemble Learning. Bagging. | # Ensemble Learning. Bagging. | ||
# Ensemble Learning. Boosting. AdaBoost. | # Ensemble Learning. Boosting. AdaBoost. | ||
Строка 118: | Строка 119: | ||
# Data Mining. Frequent Itemset Mining and Association Rules. Interestinngess Measures. Closed and Maximal Itemsets. | # Data Mining. Frequent Itemset Mining and Association Rules. Interestinngess Measures. Closed and Maximal Itemsets. | ||
# Data Mining. Frequent Itemset Mining and Association Rules. Apriori vs. FP-growth. | # Data Mining. Frequent Itemset Mining and Association Rules. Apriori vs. FP-growth. | ||
− | # Recommender Systems. Collaborative Filtering. Item-based and user-based techniques. | + | # Recommender Systems. Collaborative Filtering. Item-based and user-based techniques. Quality metrics and bimodal cross-validation. |
# Recommender Systems. NMF, Boolean Matrix Factorisation and SVD for Collaborative Filtering. | # Recommender Systems. NMF, Boolean Matrix Factorisation and SVD for Collaborative Filtering. | ||
# Recommender Systems. Advances in matrix factorisation: PureSVD, SVD++, timeSVD, ALS, Factorisation Machines. | # Recommender Systems. Advances in matrix factorisation: PureSVD, SVD++, timeSVD, ALS, Factorisation Machines. | ||
+ | * Small tasks. | ||
− | + | Examples of exercises with pen and pencil. | |
+ | |||
+ | # Given a small dataset 5 x 4, find its most informative attributes based on Information Gain and Gini Index. | ||
+ | # Given a toy set of transactions, find no less than three association rules with a given support and confidence. | ||
+ | # Given a tiny user-item table, find the top three recommendations for a given user by user-based and item-based approaches. | ||
+ | # Given a little matrix of user-item interactions, find its product into Boolean matrices of preferably smaller second dimensions. |
Версия 20:12, 11 июня 2021
Lecturer: Dmitry Ignatov
TA: Stefan Nikolić
Final mark formula: FM = 0.8 Homeworks + 0.2 Exam.
Содержание
[убрать]- 1 Homeworks
- 2 Lecture on 12 January 2021
- 3 Lecture on 19 January 2021
- 4 Lecture on 26 January 2021
- 5 Lecture on 2 February 2021
- 6 Lecture on 09 February 2021
- 7 Practice on 16 Feb 2021
- 8 Lecture on 2 March 2021
- 9 Lecture on 9 March 2021
- 10 Lecture + Practice on 16 March 2021
- 11 Practice on 6 April 2021
- 12 Lecture on 13 April 2021
- 13 Lecture + Practice on 25 April 2021
- 14 Lecture on 11 May 2021
- 15 Practice plus Lecture on 18 May 2021
- 16 Exam
Homeworks
- Homework 1: Spectral Clustering
- Homework 2:
- Homework 3: Recommender Systems
Lecture on 12 January 2021
Intro slides. Course plan. Assessment criteria. ML&DM libraries. What to read and watch?
Practice: demonstration with Orange.
Lecture on 19 January 2021
Classification. One-rule. Naïve Bayes. kNN. Logistic Regression. Train-test split and cross-validation. Quality Metrics (TP, FP, TN, FN, Precision, Recall, F-measure, Accuracy).
Practice: demonstration with Orange.
Lecture on 26 January 2021
Classification (continued). Quality metrics. ROC curves.
Practice: demonstration with Orange.
Lecture on 2 February 2021
Introduction to Clustering. Taxonomy of clustering methods. K-means. K-medoids. Fuzzy C-means. Types of distance metrics. Hierarchical clustering. DBScan
Practice: DBScan Demo.
Lecture on 09 February 2021
- Introduction to Clustering (continued). Density-based techniques. DBScan and Mean-shift.
- Graph and spectral clustering. Min-cuts and normalized cuts. Laplacian matrix. Fiedler vector. Applications.
Practice on 16 Feb 2021
Clustering with scikit-learn (k-means, hierarchical clustering, DBScan, MeanShift, Spectral Clustering).
Lecture on 2 March 2021
Practice: Spectral clustering.
Lecture: Decision tree learning. ID3. Information Entropy. Information gain. Gini coefficient and index. Overfitting and pruning. Decision trees for numeric data. Oblivious decision trees. Regression trees.
Lecture on 9 March 2021
Frequent Itemsets. Association Rules. Algorithms: Apriori, FP-growth. Interestingness measures. Closed and maximal itemsets.
Lecture + Practice on 16 March 2021
Frequent Itemset Mining (continued). Applications: 1) Taxonomies of Website Visitors and 2) Web advertising.
Exercises. Frequent Itemsets. FP-growth. Closed itemsets.
Practice. Orange, SPMF, Concept Explorer.
Practice on 6 April 2021
Practice. Scikit-learn tutorial on kNN, Decision Trees, NaÏveBayes, Logistic Regression, SVM etc.
Lecture on 13 April 2021
Introduction to Recommender systems. Taxonomy of Recommender Systems (non-personalised, content-based, collaborative filtering, hybrid etc). Real Examples. User-based and item-based collaborative filtering. Bimodal cross-validation.
Lecture + Practice on 25 April 2021
Practice: User-based and item-based collaborative filtering with Python and MovieLens.
Case-study: Non-negative Matrix Factorisation, Boolean Matrix Factorisation vs. SVD in Collaborative Filtering.
Lecture: Advanced factorisation models: PureSVD, SVD++, timeSVD, ALS.
Lecture on 11 May 2021
- Advanced factorisation models: Factorisation Machines (continued).
- Supervised Ensemble Learning. Bias-Variance decomposition. Bagging. Random Forest. Boosting for classification (AdaBoost) and regression. Stacking and Blending. Recommendation of Classifiers.
Practice plus Lecture on 18 May 2021
Practice: Bagging, Pasting, Random Projections, and Patching. Random Forest and Extra Trees. Gradient Boosting. Voting.
Lecture on Gradient Boosting.
Exam
- Date: 29.06.2021. Starting time: 11:00. Location: remote exam.
- Questions.
What is and how does it work questions based on the studied topics.
- Taxonomy of DM and ML methods.
- Classification. One-rule and Decision Stumps. Decision Trees. ID3 algorithm.
- Classification. Naïve Bayes. Smoothing.
- Classification. KNN
- Classification. Logistic regression.
- Classification quality metrics. ROC and AUC.
- Clustering. k-means and k-medoids. Fuzzy c-means.
- Clustering. Hierarchical clustering.
- Clustering. DBScan and Mean-Shift.
- Clustering quality metrics. Silhouette. Elbow method. Cophenetic distance. Calinski and Harabasz score.
- Spectral Clustering. Laplacian graph transformation and min-cuts.
- Decision Trees. ID3. Information gain and Gini index.
- Ensemble Learning. Bias and variance decomposition. Overfitting.
- Ensemble Learning. Bagging.
- Ensemble Learning. Boosting. AdaBoost.
- Ensemble Learning. Random Forest.
- Ensemble Learning. Gradient Boosting.
- Data Mining. Frequent Itemset Mining and Association Rules. Interestinngess Measures. Closed and Maximal Itemsets.
- Data Mining. Frequent Itemset Mining and Association Rules. Apriori vs. FP-growth.
- Recommender Systems. Collaborative Filtering. Item-based and user-based techniques. Quality metrics and bimodal cross-validation.
- Recommender Systems. NMF, Boolean Matrix Factorisation and SVD for Collaborative Filtering.
- Recommender Systems. Advances in matrix factorisation: PureSVD, SVD++, timeSVD, ALS, Factorisation Machines.
- Small tasks.
Examples of exercises with pen and pencil.
- Given a small dataset 5 x 4, find its most informative attributes based on Information Gain and Gini Index.
- Given a toy set of transactions, find no less than three association rules with a given support and confidence.
- Given a tiny user-item table, find the top three recommendations for a given user by user-based and item-based approaches.
- Given a little matrix of user-item interactions, find its product into Boolean matrices of preferably smaller second dimensions.