A Scrapy TV Show

by: Joshua Roberge

Objective: This Analysis will attempt to understand what it takes to have a successful TV show. To answer the purposed question, this analysis will use three main techniques: one, web scrapping to acquire the data; two, descriptive statistics to understand what factors influence the popularity of movies; and three, Sentiment Analysis to understand peoples emotional connections to the shows they watch.

Web Sites: Permissions Dependencies & lexicons Table of contents

Section 1: Web Scrapping


Section One Part One: Best Episodes of All Time Scrapes

This is a ranking of the best tv episodes of all time. The episodes from the website range from 1960 to the years 2019. The reviews from this website will be used to understand the makeup of a good TV show

#### Html getting information ###
best_100<-read_html('http://besttv.theringer.com/')

x=TRUE

rankings<-best_100 %>%
html_nodes('div.meta-bar')%>%
html_node('span.mobile')%>%
html_text()

show<-best_100 %>%
html_nodes('div.meta-bar')%>%
html_node('h4')%>%
html_text()

episode<-best_100 %>%
html_nodes('div.meta-bar')%>%
html_node('h4.ep-number')%>%
html_text()

ep_name<-best_100 %>%
html_nodes('div.show-content-wrapper')%>%
html_node('h2')%>%
html_text()

review<-best_100 %>%
html_nodes('div.blurb')%>%
html_text()


director<-best_100 %>%
html_nodes('div.director')%>%
html_node('span.value')%>%
html_text()

WhereToWatch<-best_100%>%
html_nodes('div.rank')%>%
html_node('div.where-to-watch ul li')%>%
html_text()


writer<-best_100%>%
html_nodes('div.rank')%>%
html_node('div.writer span.value')%>%
html_text()



ranks<-best_100%>%
html_nodes('div.rank')%>%
html_text()


director<-str_replace(str_extract(ranks, "Directed By [:alpha:]+ [:alpha:]+"),"Directed By", "")
Exc_produxer<-str_replace(str_extract(director, "Executive Producer ([:alpha:] ,)+ [:alpha:]"),"Directed By", "")

excutive_producer<-str_extract(ranks, "Executive Producer ([:alpha:]+[:blank:]{1}[:alpha:]+[:blank:]{1})+D")
excutive_producer<-str_replace(str_replace(excutive_producer, " D$", ""), "Executive Producer", "")
director

[1] " Michelle MacLaren" " Toby Haynes" " George Plamondon"
[4] " Philip Martin" " Craig Zisk" " David Solomon"
[7] " Adam Sandler" " Stephen Hopkins" " Hettie Macdonald"
[10] " Bruce Gowers" " Barry Sonnenfeld" NA
[13] " Andrew Barchilon" " Bret Haaland" " Andy Devonshire"
[16] NA " Bryan Gordon" " Brad Falchuk"
[19] " Melina Matsoukas" NA " Judd Apatow"
[22] " Patrick Lope" " Walt Dohrn" " Ryan Murphy"
[25] " Jody Hill" NA " Beth McCarthy"
[28] " Jeff Melman" " John Wells" " Greg Yaitanes"
[31] NA " Bryan Gordon" " David Lynch"
[34] " Alan Ball" " Louis C" " Brooke Kennedy"
[37] " Anthony Russo" " Michael Cuesta" " Kevin McCarthy"
[40] " Dee Rees" " Tony Sacco" " Adam Arkin"
[43] " Ken Fuchs" " Sam Esmail" " Charles McDougall" [46] NA " JC Gonzalez" " Jack Bender"
[49] " Pamela Fryman" NA " James Whitmore"
[52] " Michael Schur" " Lena Dunham" " Jeff Tremaine"
[55] " Daniel Sackheim" " Craig Zobel" " Juan Jose"
[58] " Mike Judge" NA " Michael Schur"
[61] " Steve Shill" " Chuck O" " Ted Kotcheff"
[64] " Davis Guggenheim" " Ben Bolt" " Trey Parker"
[67] " Jason Sands" " Don Roy" " Emma Napper"
[70] " Steven Zaillian" NA " Matt Shakman"
[73] " Michael Rymer" NA " Nathan Fielder"
[76] " Joss Whedon" " Gail Mancuso" NA
[79] " David Mandel" " Kenny Ortega" " Owen Harris"
[82] " Michael Dinner" " Peter Berg" NA
[85] " Paul Feig" " Donald Glover" " Brad Kreisberg"
[88] " Andrew Jarecki" " Thomas Schlamme" " Mark Tinker"
[91] " Rian Johnson" " Doug Liman" " Steve Buscemi"
[94] NA " Joe Chappelle" " Neal Brennan"
[97] " Cary Fukunaga" " David Nutter" " Jennifer Getzinger" [100] " Jack Bender"

### combining Everything ####
Best_eps<-data.frame(WhereToWatch,excutive_producer,ep_name,ranks,show,episode,rankings,director,review)
### saving dirty Version ####
#
# write.csv(Best_eps,'best_eps_dirty.csv')

Best_eps<-read.csv('best_eps_dirty.csv')
Messy Data Frame of top rated Episodes
WhereToWatch excutive_producer ep_name ranks show episode rankings director
Netflix NA “Pretty Much Dead Already” 100Character Death Directed By Michelle MacLaren Written By Scott Gimple Watch OnNetflix
  1. The Walking Dead
S2 E7 100 Michelle MacLaren
Netflix NA “The Reichenbach Fall” 99 Directed By Toby Haynes Written By Steve Thompson Watch OnNetflix
  1. Sherlock
S2 E3 99 Toby Haynes
Amazon Julie Auerbach “What Happens in Cabo” 98 Executive Producer Julie Auerbach Directed By George Plamondon Watch OnAmazon
  1. Laguna Beach
S1 E5 98 George Plamondon
Netflix NA “Smoke and Mirrors” 97 Directed By Philip Martin Written By Peter Morgan Watch OnNetflix
  1. The Crown
S1 E5 97 Philip Martin
Netflix NA “Go” 96Season Finale Directed By Craig Zisk Written By Jenji Kohan, Victoria Morrow, Ron Fitzgerald Watch OnNetflix
  1. Weeds
S3 E15 96 Craig Zisk

Clean up of Episode Data

Best_eps$WhereToWatch<-str_squish(Best_eps[['WhereToWatch']])
Best_eps<-Best_eps%>%select(-c(ranks))
Best_eps['season']<-as.numeric(str_remove_all(str_extract_all(Best_eps$episode, '^S\\d{1,2}'),'S'))


Best_eps$episode<-as.numeric(str_remove_all(str_extract_all(Best_eps$episode,'E\\d{1,3}'),'E'))

Best_eps$show<-str_remove_all(Best_eps$show,'\\d{1,3}\\. ')
Best_eps$rankings<-as.numeric(str_remove_all(Best_eps$rankings,'\\.'))

write.csv(Best_eps,'best_eps_cleanish.csv') 

kable(head(Best_eps[,-which(names(Best_eps)=="review")]), caption = 'Clean Episode Data')%>%kable_styling(bootstrap_options = c("striped", "hover"))
Clean Episode Data
X WhereToWatch excutive_producer ep_name show episode rankings director season
1 Netflix NA “Pretty Much Dead Already” The Walking Dead 7 100 Michelle MacLaren 2
2 Netflix NA “The Reichenbach Fall” Sherlock 3 99 Toby Haynes 2
3 Amazon Julie Auerbach “What Happens in Cabo” Laguna Beach 5 98 George Plamondon 1
4 Netflix NA “Smoke and Mirrors” The Crown 5 97 Philip Martin 1
5 Netflix NA “Go” Weeds 15 96 Craig Zisk 3
6 Hulu NA “Out of Gas” Firefly 5 95 David Solomon 1

Section Two Part One: Best Shows of All Time Scrapes

The website ratinggraph collects information about Tv shows, actors, directors and writers. This data will be used to acquire data for the use of descriptive statistics. These statistics will hopefully be able to answer: ‘Do nominations effect how popular a tv show is?’, ‘Are older TV shows more popular then newer ones?’ and ‘How does genres effect the ratting of a movie?’


Code for Web Scrapping Rating Graph

### initiating driver run only once!!!!!!!!!

driver <- rsDriver(browser = c("firefox"))
## checking Selenium Server versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
## checking chromedriver versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
## checking geckodriver versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD
## checking phantomjs versions:
## BEGIN: PREDOWNLOAD
## BEGIN: DOWNLOAD
## BEGIN: POSTDOWNLOAD

[1] “Connecting to remote server” $acceptInsecureCerts [1] FALSE

$browserName [1] “firefox”

$browserVersion [1] “69.0.3”

$moz:accessibilityChecks [1] FALSE

$moz:buildID [1] “20191009172106”

$moz:geckodriverVersion [1] “0.26.0”

$moz:headless [1] FALSE

$moz:processID [1] 9644

$moz:profile [1] “C:\Users\jwr17\AppData\Local\Temp\rust_mozprofileq1Ep03”

$moz:shutdownTimeout [1] 60000

$moz:useNonSpecCompliantPointerOrigin [1] FALSE

$moz:webdriverClick [1] TRUE

$pageLoadStrategy [1] “normal”

$platformName [1] “windows”

$platformVersion [1] “10.0”

$rotatable [1] FALSE

$setWindowRect [1] TRUE

$strictFileInteractability [1] FALSE

$timeouts \(timeouts\)implicit [1] 0

\(timeouts\)pageLoad [1] 300000

\(timeouts\)script [1] 30000

$unhandledPromptBehavior [1] “dismiss and notify”

$webdriver.remote.sessionid [1] “cecd3409-f762-4c6d-9924-e530a2ad6280”

$id [1] “cecd3409-f762-4c6d-9924-e530a2ad6280”

# Connect to the firefox driver and open it
remote_driver <- driver[["client"]]
remote_driver$open()

[1] “Connecting to remote server” $acceptInsecureCerts [1] FALSE

$browserName [1] “firefox”

$browserVersion [1] “69.0.3”

$moz:accessibilityChecks [1] FALSE

$moz:buildID [1] “20191009172106”

$moz:geckodriverVersion [1] “0.26.0”

$moz:headless [1] FALSE

$moz:processID [1] 6348

$moz:profile [1] “C:\Users\jwr17\AppData\Local\Temp\rust_mozprofileaTsoSA”

$moz:shutdownTimeout [1] 60000

$moz:useNonSpecCompliantPointerOrigin [1] FALSE

$moz:webdriverClick [1] TRUE

$pageLoadStrategy [1] “normal”

$platformName [1] “windows”

$platformVersion [1] “10.0”

$rotatable [1] FALSE

$setWindowRect [1] TRUE

$strictFileInteractability [1] FALSE

$timeouts \(timeouts\)implicit [1] 0

\(timeouts\)pageLoad [1] 300000

\(timeouts\)script [1] 30000

$unhandledPromptBehavior [1] “dismiss and notify”

$webdriver.remote.sessionid [1] “f3f9d1ab-d73e-475c-9d21-2b2209e708af”

$id [1] “f3f9d1ab-d73e-475c-9d21-2b2209e708af”

# # # Point the remote driver to the url of interest

 remote_driver$navigate("https://www.ratingraph.com/top_tv_shows/")
# #
# # ######## getting the list to 250 instead of 25 #######
Sys.sleep(5)
option <- remote_driver$findElement(using = 'xpath', '//*[@id="toplist_length"]/label/select/option[5]')
option$clickElement()
Sys.sleep(5)


# # ### creating a list of string number ie '1','2','3'
tabs<-sprintf("%d",seq(1:60))
# #
# # ### empty data frame for merging ###
master <- data.frame()

for (tab_num in tabs){
# #
# # ##### clicking through the tags #####
option <- remote_driver$findElement(using = 'link text', tab_num)
option$clickElement()

page <-remote_driver$getPageSource()
page1 <-read_html(page[[1]])
table<-page1%>%
html_node('table.display')%>%
html_table()
master<-rbind(master,table)
Sys.sleep(sample(1:5, 1, replace=T))

}

driver$client$close()
# master
# write.csv(master,'master.csv')
master<-read.csv('master.csv')
kable(head(master), caption = 'Dirty Data: Best Movies of all Time')%>%kable_styling(bootstrap_options = c("striped", "hover"))
Dirty Data: Best Movies of all Time
X Rank Trend Title Start.year End.year Genres Language Runtime..Min.. Seasons Episodes Wins Nominations Votes..SUM. Rating..AVG.
1 1 31 Game of Thrones 2011 2019 Action, Adventure, Drama English 57 8 73 313 505 3,413,751 8.8
2 2 86 Breaking Bad 2008 2013 Crime, Drama, Thriller English 49 5 62 138 217 1,095,244 9.0
3 3 86 The Walking Dead 2010
Drama, Horror, Thriller English 44 10 134 69 195 1,202,933 8.1
4 4 100 Supernatural 2005
Drama, Fantasy, Horror English 44 15 309 22 85 861,000 8.5
5 5 77 Friends 1994 2004 Comedy, Romance English 22 10 236 71 211 731,459 8.5
6 6 38 Black Mirror 2011
Drama, Sci-fi, Thriller English 60 5 22 23 75 785,611 7.9

Cleaing up best shows

## cleaning up master ####
## changing to numeric class
master<-read.csv('master.csv')

master$Rank<-as.numeric(str_replace_all(master$Rank,',',''))

#### trimming white spaces ####
master$Title<-str_trim(master$Title, side='both')



#### cleaning end columns by specific criteria (some 'na are only there bc the show is still in ')
 end_year_piece_one<-as.numeric(str_replace_all(master[1:1500,c('End.year')], '-', '2019'))
 i=1501
 for (endyear in master[c(1501:14780), c('End.year')]){
     if (endyear != '-'){
         end_year_piece_one<-c(end_year_piece_one, endyear)
     }else if (master$Start.year[i]>= 2016){
         end_year_piece_one<-c(end_year_piece_one, 2019)
     }else{
         end_year_piece_one<-c(end_year_piece_one, NA)


}
     i<-i+1

 }

master$End.year<-as.numeric(end_year_piece_one)

##### Numeric fitting  Runtime..####
master$Runtime..Min..<-as.numeric(str_remove_all(master$Runtime..Min.., '-'))


#### master wins #####
master$Wins<-as.numeric(str_replace_all(master$Wins, '-','0'))

master$Nominations<-as.numeric(str_replace_all(master$Nominations, '-','0'))

master$Trend<-as.numeric(master$Trend)
master$Episodes<-as.numeric(master$Episodes)

master$Votes..SUM.<-as.numeric(str_remove_all(str_replace_all(master$Votes..SUM., '-','0'), ','))
write.csv(master,"clean_best_shows_R.csv")
master<-read.csv('clean_best_shows_R.csv')
master<-select(master,-c(X.1,X))
kable(head(master), caption = 'Clean Data: Best Movies of all Time')%>%kable_styling(bootstrap_options = c("striped", "hover"))
Clean Data: Best Movies of all Time
Rank Trend Title Start.year End.year Genres Language Runtime..Min.. Seasons Episodes Wins Nominations Votes..SUM. Rating..AVG.
1 24 Game of Thrones 2011 2019 Action, Adventure, Drama English 57 8 328 313 505 3413751 8.8
2 45 Breaking Bad 2008 2013 Crime, Drama, Thriller English 49 5 308 138 217 1095244 9.0
3 45 The Walking Dead 2010 2019 Drama, Horror, Thriller English 44 10 47 69 195 1202933 8.1
4 4 Supernatural 2005 2019 Drama, Fantasy, Horror English 44 15 210 22 85 861000 8.5
5 43 Friends 1994 2004 Comedy, Romance English 22 10 155 71 211 731459 8.5
6 30 Black Mirror 2011 2019 Drama, Sci-fi, Thriller English 60 5 138 23 75 785611 7.9

Section 2: Descriptive Analysis


Section Two Part One: Nominations Vs. Rating Average

This is a ranking of the best tv episodes of all time. The best episodes data set ranges from 1950 to 2019. The reviews from this website will be used to understand the makeup of a good TV show.

library(ggplot2)
### making a coloumn which calculates how long a tv show has been arround
master$longevity<-master$End.year-master$Start.year

#### Getting Ride of Noisy and less popular movies #### 
clean<-master[(master$Rating..AVG.>=0.0 & master$Votes..SUM.>2000 & master$Rank<7000),]
### plotting 
plot<-ggplot(clean,aes(y=Nominations, size=-Rank, col=Start.year, x=Rating..AVG.))+geom_point(position='jitter')
plot+labs(size = "Ranks", caption='1')+ggtitle("Nominations Vs. Rating Average")+xlab('Rating Average')
Take aways:

Section two Part two:How does Genere Effect Movies?


master['Primary_genere']=str_extract(master$Genres,"[:alpha:]+")

#### breaking things down by Genere #####
by_genere<-master%>%
  select(Primary_genere,Rank, Rating..AVG., Votes..SUM., Nominations)%>%
  group_by(Primary_genere)%>%
  summarise(Nominations=sum(Nominations),
            Rank=mean(Rank),
            Rating..AVG.=mean(Rating..AVG.),
            Votes..SUM.=mean(Votes..SUM.),
            Nominations=mean(Nominations))


### plot one: Average Ranks by Genere ####
rank_plt<-ggplot(by_genere, aes(y=Rank, x=reorder(Primary_genere,Rank),fill=Primary_genere))+geom_bar(stat = "identity")+coord_flip()+ theme(legend.position="none")
rank_plt<-rank_plt+xlab('')+ylab('')+ggtitle("Average Rank by Genere")

### plot Two: Total Nominations by Genere###
rank_plt_2<-ggplot(by_genere, aes(y=Nominations, x=reorder(Primary_genere,Nominations),fill=Primary_genere))+geom_bar(stat = "identity")+coord_flip()+ theme(legend.position="none")
rank_plt_2<-rank_plt_2+xlab('')+ylab('')+ggtitle("Total Nominations by Genere")


ggpubr::ggarrange(rank_plt, rank_plt_2, nrow  = 1 )
Take Aways:

Section Two Part Three: Best Episodes and Where to Find them


## plot one Where to find the Best episodes ####
WhereToFindPlot<-ggplot(Best_eps,aes(WhereToWatch, fill=WhereToWatch))+geom_bar()+coord_flip()+xlab('')+ylab('')+ggtitle("Where to Find the Best Shows")+theme(legend.position="none")

### Best episodes fo all time ####
top_eps<-ggplot(Best_eps[Best_eps$rankings<=10,], aes(reorder(show, rankings), fill=show,rankings))+geom_bar(stat = 'identity')+coord_flip()+xlab('')+ylab('')+ggtitle("Top Rated Episodes of All Time")+theme(legend.position="none")+geom_text(aes(label=ep_name), size=3, hjust='right')

### putting Everything in one Grid ####                                                                                                          
ggpubr::ggarrange(WhereToFindPlot, top_eps, nrow  = 1 )
Take Aways:

Section 3: Sentiment Analysis


This portion of the analysis will implement sentiment analysis in order to understand what makes a great tv show. There will be three plots to understand these trends: plot one will be a word frequency count plot which will look at the total amount of the most used word in reviews (excluding stop words); plot 2 will be a word cloud plot which will allow us to see the most frequently used words; and plot four will utilize the AFFIN Lexicon in order to understand what emotions are linked to a memorable tv episode.



<>

Section Three Part One: Most Used Words in best Episode Review


### putting all reviews into one massive string ####
review<-tibble(text=toString(Best_eps$review))


##### making a Data frame with one word per column
tokens<-review%>%
  unnest_tokens(word,text)

  
### Romoving Stop Words ####
data("stop_words")

tokens<-tokens%>%
  anti_join(stop_words, by='word')

#### word Frequencies and filtering out the top ###
top_word<-tokens%>%
  count(word, sort=TRUE)%>%
  filter(n>10)

### getting ride of the word episode ####
top_word<-top_word[top_word$word!='episode',]

### plotting the most used words ####
ggplot(top_word,aes(reorder(word,n),n, fill=word))+geom_bar(stat = 'identity')+coord_flip()+theme(legend.position="none")+ggtitle('Most Frequent Words in Reviews')+xlab('')

take aways:


Section Three Part Two:Word Cloud


library(wordcloud2)
top_word<-tokens%>%
  count(word, sort=TRUE)%>%
  filter(n>5)
wordcloud2(top_word, shape =  'star')
take aways:

Section Three Part Three: Sentiment


sentiments<-get_sentiments('afinn')

word_sentiments<-top_word%>%
  inner_join(sentiments,by = "word")

ggplot(word_sentiments,aes(word, value, fill=n))+geom_bar(stat='identity')+ggtitle("What Emotions are Ascociated with a Good Tv Episode")

take aways: