Objective: This Analysis will attempt to understand what it takes to have a successful TV show. To answer the purposed question, this analysis will use three main techniques: one, web scrapping to acquire the data; two, descriptive statistics to understand what factors influence the popularity of movies; and three, Sentiment Analysis to understand peoples emotional connections to the shows they watch.
Web Sites: PermissionsThis is a ranking of the best tv episodes of all time. The episodes from the website range from 1960 to the years 2019. The reviews from this website will be used to understand the makeup of a good TV show
#### Html getting information ###
best_100<-read_html('http://besttv.theringer.com/')
x=TRUE
rankings<-best_100 %>%
html_nodes('div.meta-bar')%>%
html_node('span.mobile')%>%
html_text()
show<-best_100 %>%
html_nodes('div.meta-bar')%>%
html_node('h4')%>%
html_text()
episode<-best_100 %>%
html_nodes('div.meta-bar')%>%
html_node('h4.ep-number')%>%
html_text()
ep_name<-best_100 %>%
html_nodes('div.show-content-wrapper')%>%
html_node('h2')%>%
html_text()
review<-best_100 %>%
html_nodes('div.blurb')%>%
html_text()
director<-best_100 %>%
html_nodes('div.director')%>%
html_node('span.value')%>%
html_text()
WhereToWatch<-best_100%>%
html_nodes('div.rank')%>%
html_node('div.where-to-watch ul li')%>%
html_text()
writer<-best_100%>%
html_nodes('div.rank')%>%
html_node('div.writer span.value')%>%
html_text()
ranks<-best_100%>%
html_nodes('div.rank')%>%
html_text()
director<-str_replace(str_extract(ranks, "Directed By [:alpha:]+ [:alpha:]+"),"Directed By", "")
Exc_produxer<-str_replace(str_extract(director, "Executive Producer ([:alpha:] ,)+ [:alpha:]"),"Directed By", "")
excutive_producer<-str_extract(ranks, "Executive Producer ([:alpha:]+[:blank:]{1}[:alpha:]+[:blank:]{1})+D")
excutive_producer<-str_replace(str_replace(excutive_producer, " D$", ""), "Executive Producer", "")
director
[1] " Michelle MacLaren" " Toby Haynes" " George Plamondon"
[4] " Philip Martin" " Craig Zisk" " David Solomon"
[7] " Adam Sandler" " Stephen Hopkins" " Hettie Macdonald"
[10] " Bruce Gowers" " Barry Sonnenfeld" NA
[13] " Andrew Barchilon" " Bret Haaland" " Andy Devonshire"
[16] NA " Bryan Gordon" " Brad Falchuk"
[19] " Melina Matsoukas" NA " Judd Apatow"
[22] " Patrick Lope" " Walt Dohrn" " Ryan Murphy"
[25] " Jody Hill" NA " Beth McCarthy"
[28] " Jeff Melman" " John Wells" " Greg Yaitanes"
[31] NA " Bryan Gordon" " David Lynch"
[34] " Alan Ball" " Louis C" " Brooke Kennedy"
[37] " Anthony Russo" " Michael Cuesta" " Kevin McCarthy"
[40] " Dee Rees" " Tony Sacco" " Adam Arkin"
[43] " Ken Fuchs" " Sam Esmail" " Charles McDougall" [46] NA " JC Gonzalez" " Jack Bender"
[49] " Pamela Fryman" NA " James Whitmore"
[52] " Michael Schur" " Lena Dunham" " Jeff Tremaine"
[55] " Daniel Sackheim" " Craig Zobel" " Juan Jose"
[58] " Mike Judge" NA " Michael Schur"
[61] " Steve Shill" " Chuck O" " Ted Kotcheff"
[64] " Davis Guggenheim" " Ben Bolt" " Trey Parker"
[67] " Jason Sands" " Don Roy" " Emma Napper"
[70] " Steven Zaillian" NA " Matt Shakman"
[73] " Michael Rymer" NA " Nathan Fielder"
[76] " Joss Whedon" " Gail Mancuso" NA
[79] " David Mandel" " Kenny Ortega" " Owen Harris"
[82] " Michael Dinner" " Peter Berg" NA
[85] " Paul Feig" " Donald Glover" " Brad Kreisberg"
[88] " Andrew Jarecki" " Thomas Schlamme" " Mark Tinker"
[91] " Rian Johnson" " Doug Liman" " Steve Buscemi"
[94] NA " Joe Chappelle" " Neal Brennan"
[97] " Cary Fukunaga" " David Nutter" " Jennifer Getzinger" [100] " Jack Bender"
### combining Everything ####
Best_eps<-data.frame(WhereToWatch,excutive_producer,ep_name,ranks,show,episode,rankings,director,review)
### saving dirty Version ####
#
# write.csv(Best_eps,'best_eps_dirty.csv')
Best_eps<-read.csv('best_eps_dirty.csv')
WhereToWatch | excutive_producer | ep_name | ranks | show | episode | rankings | director |
---|---|---|---|---|---|---|---|
Netflix | NA | “Pretty Much Dead Already” | 100Character Death Directed By Michelle MacLaren Written By Scott Gimple Watch OnNetflix |
|
S2 E7 | 100 | Michelle MacLaren |
Netflix | NA | “The Reichenbach Fall” | 99 Directed By Toby Haynes Written By Steve Thompson Watch OnNetflix |
|
S2 E3 | 99 | Toby Haynes |
Amazon | Julie Auerbach | “What Happens in Cabo” | 98 Executive Producer Julie Auerbach Directed By George Plamondon Watch OnAmazon |
|
S1 E5 | 98 | George Plamondon |
Netflix | NA | “Smoke and Mirrors” | 97 Directed By Philip Martin Written By Peter Morgan Watch OnNetflix |
|
S1 E5 | 97 | Philip Martin |
Netflix | NA | “Go” | 96Season Finale Directed By Craig Zisk Written By Jenji Kohan, Victoria Morrow, Ron Fitzgerald Watch OnNetflix |
|
S3 E15 | 96 | Craig Zisk |
Best_eps$WhereToWatch<-str_squish(Best_eps[['WhereToWatch']])
Best_eps<-Best_eps%>%select(-c(ranks))
Best_eps['season']<-as.numeric(str_remove_all(str_extract_all(Best_eps$episode, '^S\\d{1,2}'),'S'))
Best_eps$episode<-as.numeric(str_remove_all(str_extract_all(Best_eps$episode,'E\\d{1,3}'),'E'))
Best_eps$show<-str_remove_all(Best_eps$show,'\\d{1,3}\\. ')
Best_eps$rankings<-as.numeric(str_remove_all(Best_eps$rankings,'\\.'))
write.csv(Best_eps,'best_eps_cleanish.csv')
kable(head(Best_eps[,-which(names(Best_eps)=="review")]), caption = 'Clean Episode Data')%>%kable_styling(bootstrap_options = c("striped", "hover"))
X | WhereToWatch | excutive_producer | ep_name | show | episode | rankings | director | season |
---|---|---|---|---|---|---|---|---|
1 | Netflix | NA | “Pretty Much Dead Already” | The Walking Dead | 7 | 100 | Michelle MacLaren | 2 |
2 | Netflix | NA | “The Reichenbach Fall” | Sherlock | 3 | 99 | Toby Haynes | 2 |
3 | Amazon | Julie Auerbach | “What Happens in Cabo” | Laguna Beach | 5 | 98 | George Plamondon | 1 |
4 | Netflix | NA | “Smoke and Mirrors” | The Crown | 5 | 97 | Philip Martin | 1 |
5 | Netflix | NA | “Go” | Weeds | 15 | 96 | Craig Zisk | 3 |
6 | Hulu | NA | “Out of Gas” | Firefly | 5 | 95 | David Solomon | 1 |
The website ratinggraph collects information about Tv shows, actors, directors and writers. This data will be used to acquire data for the use of descriptive statistics. These statistics will hopefully be able to answer: ‘Do nominations effect how popular a tv show is?’, ‘Are older TV shows more popular then newer ones?’ and ‘How does genres effect the ratting of a movie?’
### initiating driver run only once!!!!!!!!!
driver <- rsDriver(browser = c("firefox"))
## checking Selenium Server versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
## checking chromedriver versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
## checking geckodriver versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
## checking phantomjs versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
[1] “Connecting to remote server” $acceptInsecureCerts [1] FALSE
$browserName [1] “firefox”
$browserVersion [1] “69.0.3”
$moz:accessibilityChecks
[1] FALSE
$moz:buildID
[1] “20191009172106”
$moz:geckodriverVersion
[1] “0.26.0”
$moz:headless
[1] FALSE
$moz:processID
[1] 9644
$moz:profile
[1] “C:\Users\jwr17\AppData\Local\Temp\rust_mozprofileq1Ep03”
$moz:shutdownTimeout
[1] 60000
$moz:useNonSpecCompliantPointerOrigin
[1] FALSE
$moz:webdriverClick
[1] TRUE
$pageLoadStrategy [1] “normal”
$platformName [1] “windows”
$platformVersion [1] “10.0”
$rotatable [1] FALSE
$setWindowRect [1] TRUE
$strictFileInteractability [1] FALSE
$timeouts \(timeouts\)implicit [1] 0
\(timeouts\)pageLoad [1] 300000
\(timeouts\)script [1] 30000
$unhandledPromptBehavior [1] “dismiss and notify”
$webdriver.remote.sessionid [1] “cecd3409-f762-4c6d-9924-e530a2ad6280”
$id [1] “cecd3409-f762-4c6d-9924-e530a2ad6280”
# Connect to the firefox driver and open it
remote_driver <- driver[["client"]]
remote_driver$open()
[1] “Connecting to remote server” $acceptInsecureCerts [1] FALSE
$browserName [1] “firefox”
$browserVersion [1] “69.0.3”
$moz:accessibilityChecks
[1] FALSE
$moz:buildID
[1] “20191009172106”
$moz:geckodriverVersion
[1] “0.26.0”
$moz:headless
[1] FALSE
$moz:processID
[1] 6348
$moz:profile
[1] “C:\Users\jwr17\AppData\Local\Temp\rust_mozprofileaTsoSA”
$moz:shutdownTimeout
[1] 60000
$moz:useNonSpecCompliantPointerOrigin
[1] FALSE
$moz:webdriverClick
[1] TRUE
$pageLoadStrategy [1] “normal”
$platformName [1] “windows”
$platformVersion [1] “10.0”
$rotatable [1] FALSE
$setWindowRect [1] TRUE
$strictFileInteractability [1] FALSE
$timeouts \(timeouts\)implicit [1] 0
\(timeouts\)pageLoad [1] 300000
\(timeouts\)script [1] 30000
$unhandledPromptBehavior [1] “dismiss and notify”
$webdriver.remote.sessionid [1] “f3f9d1ab-d73e-475c-9d21-2b2209e708af”
$id [1] “f3f9d1ab-d73e-475c-9d21-2b2209e708af”
# # # Point the remote driver to the url of interest
remote_driver$navigate("https://www.ratingraph.com/top_tv_shows/")
# #
# # ######## getting the list to 250 instead of 25 #######
Sys.sleep(5)
option <- remote_driver$findElement(using = 'xpath', '//*[@id="toplist_length"]/label/select/option[5]')
option$clickElement()
Sys.sleep(5)
# # ### creating a list of string number ie '1','2','3'
tabs<-sprintf("%d",seq(1:60))
# #
# # ### empty data frame for merging ###
master <- data.frame()
for (tab_num in tabs){
# #
# # ##### clicking through the tags #####
option <- remote_driver$findElement(using = 'link text', tab_num)
option$clickElement()
page <-remote_driver$getPageSource()
page1 <-read_html(page[[1]])
table<-page1%>%
html_node('table.display')%>%
html_table()
master<-rbind(master,table)
Sys.sleep(sample(1:5, 1, replace=T))
}
driver$client$close()
# master
# write.csv(master,'master.csv')
master<-read.csv('master.csv')
kable(head(master), caption = 'Dirty Data: Best Movies of all Time')%>%kable_styling(bootstrap_options = c("striped", "hover"))
X | Rank | Trend | Title | Start.year | End.year | Genres | Language | Runtime..Min.. | Seasons | Episodes | Wins | Nominations | Votes..SUM. | Rating..AVG. |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 31 | Game of Thrones | 2011 | 2019 | Action, Adventure, Drama | English | 57 | 8 | 73 | 313 | 505 | 3,413,751 | 8.8 |
2 | 2 | 86 | Breaking Bad | 2008 | 2013 | Crime, Drama, Thriller | English | 49 | 5 | 62 | 138 | 217 | 1,095,244 | 9.0 |
3 | 3 | 86 | The Walking Dead | 2010 |
|
Drama, Horror, Thriller | English | 44 | 10 | 134 | 69 | 195 | 1,202,933 | 8.1 |
4 | 4 | 100 | Supernatural | 2005 |
|
Drama, Fantasy, Horror | English | 44 | 15 | 309 | 22 | 85 | 861,000 | 8.5 |
5 | 5 | 77 | Friends | 1994 | 2004 | Comedy, Romance | English | 22 | 10 | 236 | 71 | 211 | 731,459 | 8.5 |
6 | 6 | 38 | Black Mirror | 2011 |
|
Drama, Sci-fi, Thriller | English | 60 | 5 | 22 | 23 | 75 | 785,611 | 7.9 |
## cleaning up master ####
## changing to numeric class
master<-read.csv('master.csv')
master$Rank<-as.numeric(str_replace_all(master$Rank,',',''))
#### trimming white spaces ####
master$Title<-str_trim(master$Title, side='both')
#### cleaning end columns by specific criteria (some 'na are only there bc the show is still in ')
end_year_piece_one<-as.numeric(str_replace_all(master[1:1500,c('End.year')], '-', '2019'))
i=1501
for (endyear in master[c(1501:14780), c('End.year')]){
if (endyear != '-'){
end_year_piece_one<-c(end_year_piece_one, endyear)
}else if (master$Start.year[i]>= 2016){
end_year_piece_one<-c(end_year_piece_one, 2019)
}else{
end_year_piece_one<-c(end_year_piece_one, NA)
}
i<-i+1
}
master$End.year<-as.numeric(end_year_piece_one)
##### Numeric fitting Runtime..####
master$Runtime..Min..<-as.numeric(str_remove_all(master$Runtime..Min.., '-'))
#### master wins #####
master$Wins<-as.numeric(str_replace_all(master$Wins, '-','0'))
master$Nominations<-as.numeric(str_replace_all(master$Nominations, '-','0'))
master$Trend<-as.numeric(master$Trend)
master$Episodes<-as.numeric(master$Episodes)
master$Votes..SUM.<-as.numeric(str_remove_all(str_replace_all(master$Votes..SUM., '-','0'), ','))
write.csv(master,"clean_best_shows_R.csv")
master<-read.csv('clean_best_shows_R.csv')
master<-select(master,-c(X.1,X))
kable(head(master), caption = 'Clean Data: Best Movies of all Time')%>%kable_styling(bootstrap_options = c("striped", "hover"))
Rank | Trend | Title | Start.year | End.year | Genres | Language | Runtime..Min.. | Seasons | Episodes | Wins | Nominations | Votes..SUM. | Rating..AVG. |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 24 | Game of Thrones | 2011 | 2019 | Action, Adventure, Drama | English | 57 | 8 | 328 | 313 | 505 | 3413751 | 8.8 |
2 | 45 | Breaking Bad | 2008 | 2013 | Crime, Drama, Thriller | English | 49 | 5 | 308 | 138 | 217 | 1095244 | 9.0 |
3 | 45 | The Walking Dead | 2010 | 2019 | Drama, Horror, Thriller | English | 44 | 10 | 47 | 69 | 195 | 1202933 | 8.1 |
4 | 4 | Supernatural | 2005 | 2019 | Drama, Fantasy, Horror | English | 44 | 15 | 210 | 22 | 85 | 861000 | 8.5 |
5 | 43 | Friends | 1994 | 2004 | Comedy, Romance | English | 22 | 10 | 155 | 71 | 211 | 731459 | 8.5 |
6 | 30 | Black Mirror | 2011 | 2019 | Drama, Sci-fi, Thriller | English | 60 | 5 | 138 | 23 | 75 | 785611 | 7.9 |
This is a ranking of the best tv episodes of all time. The best episodes data set ranges from 1950 to 2019. The reviews from this website will be used to understand the makeup of a good TV show.
library(ggplot2)
### making a coloumn which calculates how long a tv show has been arround
master$longevity<-master$End.year-master$Start.year
#### Getting Ride of Noisy and less popular movies ####
clean<-master[(master$Rating..AVG.>=0.0 & master$Votes..SUM.>2000 & master$Rank<7000),]
### plotting
plot<-ggplot(clean,aes(y=Nominations, size=-Rank, col=Start.year, x=Rating..AVG.))+geom_point(position='jitter')
plot+labs(size = "Ranks", caption='1')+ggtitle("Nominations Vs. Rating Average")+xlab('Rating Average')
Take aways:master['Primary_genere']=str_extract(master$Genres,"[:alpha:]+")
#### breaking things down by Genere #####
by_genere<-master%>%
select(Primary_genere,Rank, Rating..AVG., Votes..SUM., Nominations)%>%
group_by(Primary_genere)%>%
summarise(Nominations=sum(Nominations),
Rank=mean(Rank),
Rating..AVG.=mean(Rating..AVG.),
Votes..SUM.=mean(Votes..SUM.),
Nominations=mean(Nominations))
### plot one: Average Ranks by Genere ####
rank_plt<-ggplot(by_genere, aes(y=Rank, x=reorder(Primary_genere,Rank),fill=Primary_genere))+geom_bar(stat = "identity")+coord_flip()+ theme(legend.position="none")
rank_plt<-rank_plt+xlab('')+ylab('')+ggtitle("Average Rank by Genere")
### plot Two: Total Nominations by Genere###
rank_plt_2<-ggplot(by_genere, aes(y=Nominations, x=reorder(Primary_genere,Nominations),fill=Primary_genere))+geom_bar(stat = "identity")+coord_flip()+ theme(legend.position="none")
rank_plt_2<-rank_plt_2+xlab('')+ylab('')+ggtitle("Total Nominations by Genere")
ggpubr::ggarrange(rank_plt, rank_plt_2, nrow = 1 )
Take Aways: ## plot one Where to find the Best episodes ####
WhereToFindPlot<-ggplot(Best_eps,aes(WhereToWatch, fill=WhereToWatch))+geom_bar()+coord_flip()+xlab('')+ylab('')+ggtitle("Where to Find the Best Shows")+theme(legend.position="none")
### Best episodes fo all time ####
top_eps<-ggplot(Best_eps[Best_eps$rankings<=10,], aes(reorder(show, rankings), fill=show,rankings))+geom_bar(stat = 'identity')+coord_flip()+xlab('')+ylab('')+ggtitle("Top Rated Episodes of All Time")+theme(legend.position="none")+geom_text(aes(label=ep_name), size=3, hjust='right')
### putting Everything in one Grid ####
ggpubr::ggarrange(WhereToFindPlot, top_eps, nrow = 1 )
Take Aways: This portion of the analysis will implement sentiment analysis in order to understand what makes a great tv show. There will be three plots to understand these trends: plot one will be a word frequency count plot which will look at the total amount of the most used word in reviews (excluding stop words); plot 2 will be a word cloud plot which will allow us to see the most frequently used words; and plot four will utilize the AFFIN Lexicon in order to understand what emotions are linked to a memorable tv episode.
### putting all reviews into one massive string ####
review<-tibble(text=toString(Best_eps$review))
##### making a Data frame with one word per column
tokens<-review%>%
unnest_tokens(word,text)
### Romoving Stop Words ####
data("stop_words")
tokens<-tokens%>%
anti_join(stop_words, by='word')
#### word Frequencies and filtering out the top ###
top_word<-tokens%>%
count(word, sort=TRUE)%>%
filter(n>10)
### getting ride of the word episode ####
top_word<-top_word[top_word$word!='episode',]
### plotting the most used words ####
ggplot(top_word,aes(reorder(word,n),n, fill=word))+geom_bar(stat = 'identity')+coord_flip()+theme(legend.position="none")+ggtitle('Most Frequent Words in Reviews')+xlab('')
take aways:
library(wordcloud2)
top_word<-tokens%>%
count(word, sort=TRUE)%>%
filter(n>5)
wordcloud2(top_word, shape = 'star')
take aways:
sentiments<-get_sentiments('afinn')
word_sentiments<-top_word%>%
inner_join(sentiments,by = "word")
ggplot(word_sentiments,aes(word, value, fill=n))+geom_bar(stat='identity')+ggtitle("What Emotions are Ascociated with a Good Tv Episode")
take aways: