by: J.Roberge
How do descriptions effect a video game's performance. Do high ranking video games have a more detailed description and most importantly, can this description be used to predict a video games performance? This analysis will utilize a previously curated dataset that has tokenized the description of 10,000 video games which were found on itunes. The issues going forward will be how to most effectively deal with an extremely sparse matrix and which type of model performs best on this type of dataset. This analysis will be broken down into three sections: section One, will deal with outlier detection; section two, will reduce the dimensions; and section 3 will fit the models.
Please note, that the way I fit these models is not 'correct' (you shouldn't touch your y_test until the very end), but method used was only employed do to the constraints of the assignment (which can be found here)
Table of Contents
### importing dependencies ####
import pandas as pd
import numpy as np
### dimension reduction teckniques
from sklearn.decomposition import PCA
from sklearn.decomposition import SparsePCA
from sklearn.manifold import TSNE
### Validation techniques/ search techniques
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import ParameterGrid
## models
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
### outlier
from sklearn.preprocessing import StandardScaler
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
### plotting
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns
### feature selection
from xgboost import plot_importance
from sklearn.feature_selection import SelectKBest, chi2, f_regression
### reading in Data Frame
%cd "C:\Users\jwr17\OneDrive - University of New Hampshire\machine learning\quiz_2\Quiz 2 ML"
### importing data sets ###
df_token=pd.read_csv("Apple1000games.csv", index_col=0)
df_des=pd.read_csv("apple1000new.csv")
### mapping to categorical
# df_des
cuts=pd.cut(df_des['Average User Rating'], [-1, 2.99, 3.99, 5.5], labels=['Poor', 'Good', "Great"])
# adding cuts to the description csv
df_des['User_cat']=cuts
### checking for missing values
df_des.User_cat.isna().sum()
### im going to impute missing values to Poor do to the size of the
### data set (loosing 100 rows is significant and i believe it is safe to assume that games without a ratting are more than lickley poor)
df_des['User_cat']=df_des.User_cat.fillna('Poor')
print("Shape fo description file", df_des.shape)
display(df_des.head(3))
print("\nShape fo token file", df_token.shape)
display(df_token.head())
I'm currently at a crossroads with outlier detection. This is a sparse matrixs and I worry that a cell with any sort of value in it may be considered an outlier. In order to ensure I'm not just getting rid of values good data I will sort thorough what is considered an outlier.
#### outlier Detection ####
# I am currently at a corssroads with outlier detection. Im a little worried about the data being sparse and trying to run outlier
# analysis on it.
### standardization ####
standard=StandardScaler().fit_transform(df_token)
df_token=pd.DataFrame(standard, columns=df_token.columns)
### mahalanobis ###
clf = EllipticEnvelope(contamination=.05,random_state=0)
clf.fit(df_token)
Envelope_prediction= clf.predict(df_token)
envelope_scroes= pd.Series(clf.decision_function(df_token))
### isolation forest ###
clf = IsolationForest( n_estimators=400, random_state=4, n_jobs=-1, contamination=.05)
clf.fit(df_token)
outliers=clf.predict(df_token)
TagRemovePreprocessor.remove_all_outputs_tags
After reviewing the outliers it seems that the mahalanobis and isolation forest performed well. From here I will drop everything that it considered to be an outlier.
#### dropping outliers ###
outliers_master=pd.concat([pd.Series(Envelope_prediction), pd.Series(outliers)], axis=1)
outliers_master.columns=['mahalanobis', 'isolation']
outliers_master[outliers_master.isolation<1]
print("The shape of the outlier dataframe: ",outliers_master.shape)
display(outliers_master.head(3))
### Droping outliers
df_token=df_token[(outliers_master.isolation==1) | (outliers_master.mahalanobis ==1)]
df_des=df_des[(outliers_master.isolation==1) | (outliers_master.mahalanobis ==1)]
print("The shape of the token dataframe after outlier drop:", df_token.shape)
display(df_token.head(3))
print("The shape of the description dataframe after outlier drop:", df_des.shape)
display(df_des.head(3))
Currently there is around 10,00 features in this dataset, and these features are extremely sparse. In order to get a working model, dimension reduction must take place. I will employ three dimension reduction techniques: first, I will use Principle Component Analysis and I will keep 90% of the variation; Second, I will employ Sparse PCA and keep the top 20 Components; and third I will use T_SNE and keep three components.
## pca's
pca=PCA(0.9).fit_transform(df_token)
master = pd.DataFrame(pca, columns=['PC_'+str(i) for i in range(1,pca.shape[1]+1)])
## sparse pca
sparse=SparsePCA(n_components=20, n_jobs=-1).fit_transform(df_token)
sparse= pd.DataFrame(sparse, columns=['Sparse_'+str(i) for i in range(1,21)])
master=pd.concat([sparse,master], axis=1)
del(sparse)
del(pca)
print("Shape of the reduced dimension df with sparse and pca", master.shape)
display(master.head(3))
After running through several interactions, and after doing some research I have decided to change the measuring metric on T_SNE from euclidean to cosine distance. According to some research, changing the distance measurement to cosine empirically performs better on a sparse matrices. Additionally, the plot does seems to be slightly better. (the plot will print outside of the notebook)
### t_sne
TSNE()
X_embedding = TSNE(metric='cosine',n_components=3, perplexity=1000, n_iter = 500, learning_rate=500).fit_transform(df_token,)
### 3d Graph Prints outside notebook ###
get_ipython().run_line_magic('matplotlib', 'inline')
df_sne_3d=pd.DataFrame(X_embedding, columns=['tsne_1',"tsne_2","tsne_3"])
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
color=[]
for a in df_des['User_cat']:
if a=='Great':
color.append('green')
elif a=='Good':
color.append('yellow')
elif a=='Poor':
color.append('red')
else:
color.append('NaN')
ax.scatter3D(xs=df_sne_3d.tsne_1, ys=df_sne_3d.tsne_2, zs=df_sne_3d.tsne_3,c=color, marker='o')
### appending t-sne to master
master=pd.concat([master, df_sne_3d],axis=1)
print("Shape of the reduced dimension df with tsne, sparse and pca", master.shape)
display(master.head(3))
### Cleaning up my memory
del(df_sne_3d, cuts, color,ax,fig,standard,outliers,outliers_master,Envelope_prediction,X_embedding,a)
The following graph shows the total count of all target values. I made this graph to see the balance of the classes. In all, I would say that the class balance is pretty good, and therefor I will not pursue any type of re-sampling techniques.
### seeing how the classes balance out
get_ipython().run_line_magic('matplotlib', 'inline')
sns.countplot(df_des['User_cat'])
Do to the constraint of the assignment I will be limited in my use in using gridsearchcv(), and this is because I can't fit my y_testing data on individual models when I run grid search. In order to alleviate this issue I will use Paramgrid(), which is a function that returns the grid of gridsearchcv(). This param grid will be used to fit indivdual models throughout the analysis. The fitting process will be broken down into four parts. Part one, will be the dictionary layout for paramgrid. Part two will be the functions that I plan to use for the modeling fitting. Part three will be fitting all possible models using four types of features which are all possible features, TSNE features, Sparse PCA features and PCA features. The Last Part will fit models all possible models based on feature importance.
#### dictionaries for paramgrid ###
knn_params={
'p':[2],
'n_neighbors':[5,10,20,50],
'n_jobs':[-1]
}
random_params={
'n_estimators':[100,500],
'min_samples_split':[5],
'min_samples_leaf':[5],
'n_jobs':[-1]
}
gradient_params={
'learning_rate':[.1,.01,.001],
'n_estimators':[100,500],
'subsample':[.6,.8,1],
'min_samples_split':[5],
'min_samples_leaf':[5],
'random_state':[4],
}
xg_params={
'learning_rate':[.1,.01,.001],
'n_estimators':[100,500],
'subsample':[.6,.8,1],
'max_depth':[5,7,9],
'n_jobs':[-1]
}
master_params=[knn_params, random_params, gradient_params, xg_params]
The following function takes in a model and a parameter grid and fits all possible models on a training set and then predicts the testing set
from sklearn.model_selection import ParameterGrid
from sklearn.model_selection import fit_grid_point
def get_model(model,x_train, y_train, x_test,y_test, param, scorring1, scorring2):
"""
This function fits and returns a serries for a dataframe
What it returns: all models given the paramter grid fit to the test data.
"""
### creates the paramter grid
param_grid=ParameterGrid(param)
### puts the param grid into a dataframe
df_param_grid=pd.DataFrame(param_grid)
master_score1=[]
master_score2=[]
### assigning clasifier name to the data frame
df_param_grid['Clasifier']=model.__name__
for params in param_grid:
clf=model()
## setting the individual parameter grid to the function
clf.set_params(**params)
### fitting the model
clf.fit(x_train.values,y_train.astype(str).values)
pred=clf.predict(x_test.values)
### calculating the score of y_test vs y_pred
score=scorring1(y_test.astype(str).values, pred)
print(score)
master_score1.append(score)
score=scorring2(y_test.astype(str).values, pred, average='macro')
print(score)
master_score2.append(score)
### assining the scores to the param grid df
df_param_grid[scorring1.__name__]=master_score1
df_param_grid[scorring2.__name__]=master_score2
return df_param_grid
x=ParameterGrid(xg_params)
for i in x:
x=XGBClassifier(**i)
x.fit(y_train)
x.
The following for loop fits all possible models (see master model list) across all possible parameters (see parameter dictionary), and then fits all models across the different dimension reduction techniques.
master_params=[knn_params, random_params, gradient_params, xg_params]
master_models=[KNeighborsClassifier, RandomForestClassifier, GradientBoostingClassifier, XGBClassifier]
target=df_des['User_cat'].astype(str)
### column list for fitting th different models
master_cols=master.columns
sparse_cols=[col for col in master.columns if 'Sparse' in col]
PC_cols=[col for col in master.columns if 'PC' in col]
TSNE=[col for col in master.columns if 'tsne' in col]
col_list=[master_cols, sparse_cols, PC_cols, sparse_cols]
col_names=['all', 'sparse', 'PCA', 'TSNE']
master_df=pd.DataFrame()
for i ,col in enumerate(col_list):
## splits train and Test
x_train, x_test, y_train, y_test= train_test_split(master[col], target,test_size=.2)
### fits model on dimensions col
for j ,model in enumerate(master_models):
df=get_model(model, x_train, y_train,x_test,y_test, master_params[j], accuracy_score, f1_score)
df['features']=col_names[i]
master_df=pd.concat([master_df, df], axis=0)
master_df.to_csv("master_quiz_3.csv")
### Observing The output ###
print("Shape of Paramgrid: ", master_df.shape)
display(master_df.head(3))
The following are models based on feature importance. The features importance was determined through XGBoost's feature importance moduel. The model fit from XGBoost feature importance was determined using the best fit model from the previous model fitting. The features for this model run was determined using three criteria: information gain, weight, and total information gain. These features will be broken down into four categories and these categoreis will be run through all possible models.
### looking through the results i beleive a combination of features may produce the best results, for this excercise i will
### pick what seem to be the best features.
## step one using the best model thuse far (using f1-score) and extracting the best features
x_train, x_test, y_train, y_test= train_test_split(master, target,test_size=.2)
### fitting the top model from previous model fittings
clf=XGBClassifier(learning_rate=.1, max_depth=7, n_estimators=100, n_jobs=-1, subsample=1.)
clf.fit(x_train,y_train)
### plotting best features
### To get gain and weight chnage importance_type='total_gain'
plot_importance(clf,max_num_features=20, importance_type='total_gain')
#### fitting model after slecting features based on gain, weight and total gain
total_gain=['PC_1','PC_12','PC_3','PC_10','PC_220', 'PC_16','PC_45','PC_54','PC_57', 'tsne_3']
weight=['tsne_1','tsne_2','tsne_3','Pc_12', 'PC_3', 'PC_16', 'Sparse_1', 'Sparse_11','PC_9', 'PC_4']
gain=['PC_1','PC_57','PC_177','PC_408', 'PC_10', 'PC_328','PC_126','PC_195','PC_278','PC_307']
all_select=set(total_gain+weight+gain)
col=[total_gain, weight, gain, all_select]
col_names=['Total_Gain', 'Weight', 'Gain', 'All_select']
select_master=pd.DataFrame()
for i ,col in enumerate(col_list):
## splits train and Test
x_train, x_test, y_train, y_test= train_test_split(master[col], target,test_size=.2)
### fits model on dimensions col
for j ,model in enumerate(master_models):
df=get_model(model, x_train, y_train,x_test,y_test, master_params[j], accuracy_score, f1_score)
df['features']=col_names[i]
select_master=pd.concat([select_master, df], axis=0)
select_master.sort_values('accuracy_score')
master_model=pd.concat([select_master,master_df])
master_model['Your Name']='Joshua Roberge'
master_model['Random State']=4
master_model_1=master_model.drop(['p','random_state', 'min_samples_split', 'n_jobs','min_samples_leaf'], axis=1)
master_model_1=master_model_1.rename(columns={'Clasifier': 'Algorithm'})
master_model_1.to_csv('Master_model_1.csv')
At This point in time I can not conclusively say that a video game's performance effects its outcome, nor can I say that a video games description has any sort of predictive capabilities for predicting a video games performance.
Not all is lost! Perhaps a deeper dive into NLP could solve the issue. Going forward the use of a lexicon could prove beneficial and produce better predictive features. Additionally, I believe playing around with transformations on the matrix could improve the model's overall performance.