Quiz 2 |
Machine Learning |
Due Date: |
This quiz uses Kaggle data collected in August 2019 from Apple API. The data is related to 17000+ Strategy games like Sudoku, Pokémon Go etc. A large sample of 1000 games was extracted for this quiz. The primary objective is to explore the extent to which the design and description plays a role in the overall success of a mobile game, using classification ML techniques.
You are given two datasets, one containing basic information on games including user ratings, cost, description (by the developer), release date etc. The other dataset includes frequency of the top two thousand most frequent words (tokens) used in the description of these games. The idea is to test through ML classification, if the quality of design and popularity in terms of user ratings extend to the description and usage of language, or not.
The dataset containing tokens is stemmed, tokenized, and stop words were used to clean for non-English tokens, empty lines, spaces and special characters. The dataset only contains the frequency of token appearance in each description. Such tokenization technique is known as Bag of Words (BOW), which is the next topic to explore.
In the first dataset, create a new categorical response variable for user ratings, by creating three categories: Poor, Good and Great.
Poor = User Ratings <= 3
Good = User Ratings 3.5 or 4
Great = User Ratings 4.5 or 5
In the second dataset, perform dimension reduction in three different forms:
a) Regular PCA keeping at least 90% of the variance
b) Sparse PCA keeping at most 20 components (remember that this step will take approx.. 5 mins to run)
c) t-SNE producing three dimensions using perplexity of your choice.
Now, merge the two datasets on indices to get ready for classification.
Use only the new reduced dimensions created from PCA, Sparse PCA and t-SNE to map the categorical User Rating column. Feel free to use these dimensions in their pure form (dimensions from only one technique) or as combinations (dimensions mixed up from various techniques).
Use KNN, Random Forest, Gradient Boosting and XGBoost algorithms to compete for the best result. Use your choice of metric (F1 or Accuracy) as the benchmark measuring and comparing model performance.
Make sure to break your model down into train and test with the distribution of 80% vs 20% respectively. Dont use k-fold validation for this quiz.
The following are the minimum constraints for hyper parameters of the ML algorithms:
KNN:
Constants: Algorithm (= KD-Tree only ), Distance Metric = Euclidean (p = power of the distance formula = 2)
Variables: K-neighbors, number of features,
Random Forest:
(n_estimators=100, 500, min_samples_split=5, min_samples_leaf=5, random_state=your student number)
Gradient Boosting:
(learning_rate=0.1, 0.01, 0.001, n_estimators=100, 500, subsample=0.6, 0.8, 1, min_samples_split=5, min_samples_leaf=5, random_state=your student number,)
XGBoost:
(learning_rate=0.1, 0.01, 0.001, n_estimators=100, 500, subsample=0.6, 0.8, 1, max_depth=5, 7, 9)