web_scrap_rpoject_r_markdown

A Scrapy TV Show
by: Joshua Roberge

Objective: This Analysis will attempt to understand what it takes to have a successful TV show. To answer the purposed question, this analysis will use three main techniques: one, web scrapping to acquire the data; two, descriptive statistics to understand what factors influence the popularity of movies; and three, Sentiment Analysis to understand peoples emotional connections to the shows they watch.

Web Sites:

Permissions

https://www.theringer.com/robots.txt
Ratinggraph: no user agreement or robot.txt

Dependencies & lexicons

stringr
rvest
tidyr
tidyverse
RSelenium
dplyr
ggplot2
tidytext
Leaxicon: AFFIN

Table of contents

Section One Web Scrapping

Section Two: Descriptive Statistics

Section Three: Sentiment Analysis

Section 1: Web Scrapping

Section One Part One: Best Episodes of All Time Scrapes

This is a ranking of the best tv episodes of all time. The episodes from the website range from 1960 to the years 2019. The reviews from this website will be used to understand the makeup of a good TV show

#### Html getting information ###
best_100<-read_html('http://besttv.theringer.com/')

x=TRUE

rankings<-best_100 %>%
html_nodes('div.meta-bar')%>%
html_node('span.mobile')%>%
html_text()

show<-best_100 %>%
html_nodes('div.meta-bar')%>%
html_node('h4')%>%
html_text()

episode<-best_100 %>%
html_nodes('div.meta-bar')%>%
html_node('h4.ep-number')%>%
html_text()

ep_name<-best_100 %>%
html_nodes('div.show-content-wrapper')%>%
html_node('h2')%>%
html_text()

review<-best_100 %>%
html_nodes('div.blurb')%>%
html_text()


director<-best_100 %>%
html_nodes('div.director')%>%
html_node('span.value')%>%
html_text()

WhereToWatch<-best_100%>%
html_nodes('div.rank')%>%
html_node('div.where-to-watch ul li')%>%
html_text()


writer<-best_100%>%
html_nodes('div.rank')%>%
html_node('div.writer span.value')%>%
html_text()



ranks<-best_100%>%
html_nodes('div.rank')%>%
html_text()


director<-str_replace(str_extract(ranks, "Directed By [:alpha:]+ [:alpha:]+"),"Directed By", "")
Exc_produxer<-str_replace(str_extract(director, "Executive Producer ([:alpha:] ,)+ [:alpha:]"),"Directed By", "")

excutive_producer<-str_extract(ranks, "Executive Producer ([:alpha:]+[:blank:]{1}[:alpha:]+[:blank:]{1})+D")
excutive_producer<-str_replace(str_replace(excutive_producer, " D$", ""), "Executive Producer", "")
director

[1] " Michelle MacLaren" " Toby Haynes" " George Plamondon"
[4] " Philip Martin" " Craig Zisk" " David Solomon"
[7] " Adam Sandler" " Stephen Hopkins" " Hettie Macdonald"
[10] " Bruce Gowers" " Barry Sonnenfeld" NA
[13] " Andrew Barchilon" " Bret Haaland" " Andy Devonshire"
[16] NA " Bryan Gordon" " Brad Falchuk"
[19] " Melina Matsoukas" NA " Judd Apatow"
[22] " Patrick Lope" " Walt Dohrn" " Ryan Murphy"
[25] " Jody Hill" NA " Beth McCarthy"
[28] " Jeff Melman" " John Wells" " Greg Yaitanes"
[31] NA " Bryan Gordon" " David Lynch"
[34] " Alan Ball" " Louis C" " Brooke Kennedy"
[37] " Anthony Russo" " Michael Cuesta" " Kevin McCarthy"
[40] " Dee Rees" " Tony Sacco" " Adam Arkin"
[43] " Ken Fuchs" " Sam Esmail" " Charles McDougall" [46] NA " JC Gonzalez" " Jack Bender"
[49] " Pamela Fryman" NA " James Whitmore"
[52] " Michael Schur" " Lena Dunham" " Jeff Tremaine"
[55] " Daniel Sackheim" " Craig Zobel" " Juan Jose"
[58] " Mike Judge" NA " Michael Schur"
[61] " Steve Shill" " Chuck O" " Ted Kotcheff"
[64] " Davis Guggenheim" " Ben Bolt" " Trey Parker"
[67] " Jason Sands" " Don Roy" " Emma Napper"
[70] " Steven Zaillian" NA " Matt Shakman"
[73] " Michael Rymer" NA " Nathan Fielder"
[76] " Joss Whedon" " Gail Mancuso" NA
[79] " David Mandel" " Kenny Ortega" " Owen Harris"
[82] " Michael Dinner" " Peter Berg" NA
[85] " Paul Feig" " Donald Glover" " Brad Kreisberg"
[88] " Andrew Jarecki" " Thomas Schlamme" " Mark Tinker"
[91] " Rian Johnson" " Doug Liman" " Steve Buscemi"
[94] NA " Joe Chappelle" " Neal Brennan"
[97] " Cary Fukunaga" " David Nutter" " Jennifer Getzinger" [100] " Jack Bender"

### combining Everything ####
Best_eps<-data.frame(WhereToWatch,excutive_producer,ep_name,ranks,show,episode,rankings,director,review)
### saving dirty Version ####
#
# write.csv(Best_eps,'best_eps_dirty.csv')

Best_eps<-read.csv('best_eps_dirty.csv')

Messy Data Frame of top rated Episodes
WhereToWatch	excutive_producer	ep_name	ranks	show	episode	rankings	director
Netflix	NA	“Pretty Much Dead Already”	100Character Death Directed By Michelle MacLaren Written By Scott Gimple Watch OnNetflix	The Walking Dead	S2 E7	100	Michelle MacLaren
Netflix	NA	“The Reichenbach Fall”	99 Directed By Toby Haynes Written By Steve Thompson Watch OnNetflix	Sherlock	S2 E3	99	Toby Haynes
Amazon	Julie Auerbach	“What Happens in Cabo”	98 Executive Producer Julie Auerbach Directed By George Plamondon Watch OnAmazon	Laguna Beach	S1 E5	98	George Plamondon
Netflix	NA	“Smoke and Mirrors”	97 Directed By Philip Martin Written By Peter Morgan Watch OnNetflix	The Crown	S1 E5	97	Philip Martin
Netflix	NA	“Go”	96Season Finale Directed By Craig Zisk Written By Jenji Kohan, Victoria Morrow, Ron Fitzgerald Watch OnNetflix	Weeds	S3 E15	96	Craig Zisk

Clean up of Episode Data

Best_eps$WhereToWatch<-str_squish(Best_eps[['WhereToWatch']])
Best_eps<-Best_eps%>%select(-c(ranks))
Best_eps['season']<-as.numeric(str_remove_all(str_extract_all(Best_eps$episode, '^S\\d{1,2}'),'S'))


Best_eps$episode<-as.numeric(str_remove_all(str_extract_all(Best_eps$episode,'E\\d{1,3}'),'E'))

Best_eps$show<-str_remove_all(Best_eps$show,'\\d{1,3}\\. ')
Best_eps$rankings<-as.numeric(str_remove_all(Best_eps$rankings,'\\.'))

write.csv(Best_eps,'best_eps_cleanish.csv') 

kable(head(Best_eps[,-which(names(Best_eps)=="review")]), caption = 'Clean Episode Data')%>%kable_styling(bootstrap_options = c("striped", "hover"))

Clean Episode Data
X	WhereToWatch	excutive_producer	ep_name	show	episode	rankings	director	season
1	Netflix	NA	“Pretty Much Dead Already”	The Walking Dead	7	100	Michelle MacLaren	2
2	Netflix	NA	“The Reichenbach Fall”	Sherlock	3	99	Toby Haynes	2
3	Amazon	Julie Auerbach	“What Happens in Cabo”	Laguna Beach	5	98	George Plamondon	1
4	Netflix	NA	“Smoke and Mirrors”	The Crown	5	97	Philip Martin	1
5	Netflix	NA	“Go”	Weeds	15	96	Craig Zisk	3
6	Hulu	NA	“Out of Gas”	Firefly	5	95	David Solomon	1

Section Two Part One: Best Shows of All Time Scrapes

The website ratinggraph collects information about Tv shows, actors, directors and writers. This data will be used to acquire data for the use of descriptive statistics. These statistics will hopefully be able to answer: ‘Do nominations effect how popular a tv show is?’, ‘Are older TV shows more popular then newer ones?’ and ‘How does genres effect the ratting of a movie?’

Code for Web Scrapping Rating Graph

### initiating driver run only once!!!!!!!!!

driver <- rsDriver(browser = c("firefox"))

## checking Selenium Server versions:

## BEGIN: PREDOWNLOAD

## BEGIN: DOWNLOAD

## BEGIN: POSTDOWNLOAD

## checking chromedriver versions:

## BEGIN: PREDOWNLOAD

## BEGIN: DOWNLOAD

## BEGIN: POSTDOWNLOAD

## checking geckodriver versions:

## BEGIN: PREDOWNLOAD

## BEGIN: DOWNLOAD

## BEGIN: POSTDOWNLOAD

## checking phantomjs versions:

## BEGIN: PREDOWNLOAD

## BEGIN: DOWNLOAD

## BEGIN: POSTDOWNLOAD

[1] “Connecting to remote server” $acceptInsecureCerts [1] FALSE

$browserName [1] “firefox”

$browserVersion [1] “69.0.3”

$moz:accessibilityChecks [1] FALSE

$moz:buildID [1] “20191009172106”

$moz:geckodriverVersion [1] “0.26.0”

$moz:headless [1] FALSE

$moz:processID [1] 9644

$moz:profile [1] “C:\Users\jwr17\AppData\Local\Temp\rust_mozprofileq1Ep03”

$moz:shutdownTimeout [1] 60000

$moz:useNonSpecCompliantPointerOrigin [1] FALSE

$moz:webdriverClick [1] TRUE

$pageLoadStrategy [1] “normal”

$platformName [1] “windows”

$platformVersion [1] “10.0”

$rotatable [1] FALSE

$setWindowRect [1] TRUE

$strictFileInteractability [1] FALSE

$timeouts $timeouts$implicit [1] 0

$timeouts$pageLoad [1] 300000

$timeouts$script [1] 30000

$unhandledPromptBehavior [1] “dismiss and notify”

$webdriver.remote.sessionid [1] “cecd3409-f762-4c6d-9924-e530a2ad6280”

$id [1] “cecd3409-f762-4c6d-9924-e530a2ad6280”

# Connect to the firefox driver and open it
remote_driver <- driver[["client"]]
remote_driver$open()

[1] “Connecting to remote server” $acceptInsecureCerts [1] FALSE

$browserName [1] “firefox”

$browserVersion [1] “69.0.3”

$moz:accessibilityChecks [1] FALSE

$moz:buildID [1] “20191009172106”

$moz:geckodriverVersion [1] “0.26.0”

$moz:headless [1] FALSE

$moz:processID [1] 6348

$moz:profile [1] “C:\Users\jwr17\AppData\Local\Temp\rust_mozprofileaTsoSA”

$moz:shutdownTimeout [1] 60000

$moz:useNonSpecCompliantPointerOrigin [1] FALSE

$moz:webdriverClick [1] TRUE

$pageLoadStrategy [1] “normal”

$platformName [1] “windows”

$platformVersion [1] “10.0”

$rotatable [1] FALSE

$setWindowRect [1] TRUE

$strictFileInteractability [1] FALSE

$timeouts $timeouts$implicit [1] 0

$timeouts$pageLoad [1] 300000

$timeouts$script [1] 30000

$unhandledPromptBehavior [1] “dismiss and notify”

$webdriver.remote.sessionid [1] “f3f9d1ab-d73e-475c-9d21-2b2209e708af”

$id [1] “f3f9d1ab-d73e-475c-9d21-2b2209e708af”

# # # Point the remote driver to the url of interest

 remote_driver$navigate("https://www.ratingraph.com/top_tv_shows/")
# #
# # ######## getting the list to 250 instead of 25 #######
Sys.sleep(5)
option <- remote_driver$findElement(using = 'xpath', '//*[@id="toplist_length"]/label/select/option[5]')
option$clickElement()
Sys.sleep(5)


# # ### creating a list of string number ie '1','2','3'
tabs<-sprintf("%d",seq(1:60))
# #
# # ### empty data frame for merging ###
master <- data.frame()

for (tab_num in tabs){
# #
# # ##### clicking through the tags #####
option <- remote_driver$findElement(using = 'link text', tab_num)
option$clickElement()

page <-remote_driver$getPageSource()
page1 <-read_html(page[[1]])
table<-page1%>%
html_node('table.display')%>%
html_table()
master<-rbind(master,table)
Sys.sleep(sample(1:5, 1, replace=T))

}

driver$client$close()
# master
# write.csv(master,'master.csv')
master<-read.csv('master.csv')
kable(head(master), caption = 'Dirty Data: Best Movies of all Time')%>%kable_styling(bootstrap_options = c("striped", "hover"))

Dirty Data: Best Movies of all Time
X	Rank	Trend	Title	Start.year	End.year	Genres	Language	Runtime..Min..	Seasons	Episodes	Wins	Nominations	Votes..SUM.	Rating..AVG.
1	1	31	Game of Thrones	2011	2019	Action, Adventure, Drama	English	57	8	73	313	505	3,413,751	8.8
2	2	86	Breaking Bad	2008	2013	Crime, Drama, Thriller	English	49	5	62	138	217	1,095,244	9.0
3	3	86	The Walking Dead	2010		Drama, Horror, Thriller	English	44	10	134	69	195	1,202,933	8.1
4	4	100	Supernatural	2005		Drama, Fantasy, Horror	English	44	15	309	22	85	861,000	8.5
5	5	77	Friends	1994	2004	Comedy, Romance	English	22	10	236	71	211	731,459	8.5
6	6	38	Black Mirror	2011		Drama, Sci-fi, Thriller	English	60	5	22	23	75	785,611	7.9

Cleaing up best shows

## cleaning up master ####
## changing to numeric class
master<-read.csv('master.csv')

master$Rank<-as.numeric(str_replace_all(master$Rank,',',''))

#### trimming white spaces ####
master$Title<-str_trim(master$Title, side='both')



#### cleaning end columns by specific criteria (some 'na are only there bc the show is still in ')
 end_year_piece_one<-as.numeric(str_replace_all(master[1:1500,c('End.year')], '-', '2019'))
 i=1501
 for (endyear in master[c(1501:14780), c('End.year')]){
     if (endyear != '-'){
         end_year_piece_one<-c(end_year_piece_one, endyear)
     }else if (master$Start.year[i]>= 2016){
         end_year_piece_one<-c(end_year_piece_one, 2019)
     }else{
         end_year_piece_one<-c(end_year_piece_one, NA)


}
     i<-i+1

 }

master$End.year<-as.numeric(end_year_piece_one)

##### Numeric fitting  Runtime..####
master$Runtime..Min..<-as.numeric(str_remove_all(master$Runtime..Min.., '-'))


#### master wins #####
master$Wins<-as.numeric(str_replace_all(master$Wins, '-','0'))

master$Nominations<-as.numeric(str_replace_all(master$Nominations, '-','0'))

master$Trend<-as.numeric(master$Trend)
master$Episodes<-as.numeric(master$Episodes)

master$Votes..SUM.<-as.numeric(str_remove_all(str_replace_all(master$Votes..SUM., '-','0'), ','))
write.csv(master,"clean_best_shows_R.csv")
master<-read.csv('clean_best_shows_R.csv')
master<-select(master,-c(X.1,X))
kable(head(master), caption = 'Clean Data: Best Movies of all Time')%>%kable_styling(bootstrap_options = c("striped", "hover"))

Clean Data: Best Movies of all Time
Rank	Trend	Title	Start.year	End.year	Genres	Language	Runtime..Min..	Seasons	Episodes	Wins	Nominations	Votes..SUM.	Rating..AVG.
1	24	Game of Thrones	2011	2019	Action, Adventure, Drama	English	57	8	328	313	505	3413751	8.8
2	45	Breaking Bad	2008	2013	Crime, Drama, Thriller	English	49	5	308	138	217	1095244	9.0
3	45	The Walking Dead	2010	2019	Drama, Horror, Thriller	English	44	10	47	69	195	1202933	8.1
4	4	Supernatural	2005	2019	Drama, Fantasy, Horror	English	44	15	210	22	85	861000	8.5
5	43	Friends	1994	2004	Comedy, Romance	English	22	10	155	71	211	731459	8.5
6	30	Black Mirror	2011	2019	Drama, Sci-fi, Thriller	English	60	5	138	23	75	785611	7.9

Section 2: Descriptive Analysis

Section Two Part One: Nominations Vs. Rating Average

This is a ranking of the best tv episodes of all time. The best episodes data set ranges from 1950 to 2019. The reviews from this website will be used to understand the makeup of a good TV show.

library(ggplot2)
### making a coloumn which calculates how long a tv show has been arround
master$longevity<-master$End.year-master$Start.year

#### Getting Ride of Noisy and less popular movies #### 
clean<-master[(master$Rating..AVG.>=0.0 & master$Votes..SUM.>2000 & master$Rank<7000),]
### plotting 
plot<-ggplot(clean,aes(y=Nominations, size=-Rank, col=Start.year, x=Rating..AVG.))+geom_point(position='jitter')
plot+labs(size = "Ranks", caption='1')+ggtitle("Nominations Vs. Rating Average")+xlab('Rating Average')

Take aways:

Looking at the graph it appears that New movies are more favorable
It is hard to say how much of an impact nominations play in Avg Ratting: more analysis is needed

Section two Part two:How does Genere Effect Movies?

master['Primary_genere']=str_extract(master$Genres,"[:alpha:]+")

#### breaking things down by Genere #####
by_genere<-master%>%
  select(Primary_genere,Rank, Rating..AVG., Votes..SUM., Nominations)%>%
  group_by(Primary_genere)%>%
  summarise(Nominations=sum(Nominations),
            Rank=mean(Rank),
            Rating..AVG.=mean(Rating..AVG.),
            Votes..SUM.=mean(Votes..SUM.),
            Nominations=mean(Nominations))


### plot one: Average Ranks by Genere ####
rank_plt<-ggplot(by_genere, aes(y=Rank, x=reorder(Primary_genere,Rank),fill=Primary_genere))+geom_bar(stat = "identity")+coord_flip()+ theme(legend.position="none")
rank_plt<-rank_plt+xlab('')+ylab('')+ggtitle("Average Rank by Genere")

### plot Two: Total Nominations by Genere###
rank_plt_2<-ggplot(by_genere, aes(y=Nominations, x=reorder(Primary_genere,Nominations),fill=Primary_genere))+geom_bar(stat = "identity")+coord_flip()+ theme(legend.position="none")
rank_plt_2<-rank_plt_2+xlab('')+ylab('')+ggtitle("Total Nominations by Genere")


ggpubr::ggarrange(rank_plt, rank_plt_2, nrow  = 1 )

Take Aways:

Action TV shows seem to do the best when it comes to Average Rank
History seems to do the worst when it comes to average Rank
Surprisingly, the comedy generes seems to do the best when it comes to nominations, perhaps this has to do with sitcoms?

Section Two Part Three: Best Episodes and Where to Find them

## plot one Where to find the Best episodes ####
WhereToFindPlot<-ggplot(Best_eps,aes(WhereToWatch, fill=WhereToWatch))+geom_bar()+coord_flip()+xlab('')+ylab('')+ggtitle("Where to Find the Best Shows")+theme(legend.position="none")

### Best episodes fo all time ####
top_eps<-ggplot(Best_eps[Best_eps$rankings<=10,], aes(reorder(show, rankings), fill=show,rankings))+geom_bar(stat = 'identity')+coord_flip()+xlab('')+ylab('')+ggtitle("Top Rated Episodes of All Time")+theme(legend.position="none")+geom_text(aes(label=ep_name), size=3, hjust='right')

### putting Everything in one Grid ####                                                                                                          
ggpubr::ggarrange(WhereToFindPlot, top_eps, nrow  = 1 )

Take Aways:

Hulu has the most best episodes of all time
Lost, great tv show, but i am surprissed it had the highest rated episode

Section 3: Sentiment Analysis

This portion of the analysis will implement sentiment analysis in order to understand what makes a great tv show. There will be three plots to understand these trends: plot one will be a word frequency count plot which will look at the total amount of the most used word in reviews (excluding stop words); plot 2 will be a word cloud plot which will allow us to see the most frequently used words; and plot four will utilize the AFFIN Lexicon in order to understand what emotions are linked to a memorable tv episode.

Section Three Part One: Most Used Words in best Episode Review

### putting all reviews into one massive string ####
review<-tibble(text=toString(Best_eps$review))


##### making a Data frame with one word per column
tokens<-review%>%
  unnest_tokens(word,text)

  
### Romoving Stop Words ####
data("stop_words")

tokens<-tokens%>%
  anti_join(stop_words, by='word')

#### word Frequencies and filtering out the top ###
top_word<-tokens%>%
  count(word, sort=TRUE)%>%
  filter(n>10)

### getting ride of the word episode ####
top_word<-top_word[top_word$word!='episode',]

### plotting the most used words ####
ggplot(top_word,aes(reorder(word,n),n, fill=word))+geom_bar(stat = 'identity')+coord_flip()+theme(legend.position="none")+ggtitle('Most Frequent Words in Reviews')+xlab('')

take aways:

Unsurprisingly, more stop words need to be taken out
The most words used seem to revolve around remembering the how iconic/memorable these tv shows were

Section Three Part Two:Word Cloud

library(wordcloud2)
top_word<-tokens%>%
  count(word, sort=TRUE)%>%
  filter(n>5)
wordcloud2(top_word, shape =  'star')

take aways:

Cool interactive visual
Most words revolve around the iconic nature of the television show

Section Three Part Three: Sentiment

sentiments<-get_sentiments('afinn')

word_sentiments<-top_word%>%
  inner_join(sentiments,by = "word")

ggplot(word_sentiments,aes(word, value, fill=n))+geom_bar(stat='identity')+ggtitle("What Emotions are Ascociated with a Good Tv Episode")

take aways:

Most Reviews correlate with negative emotions when it comes to a good show
This reveals that people may have an inclination towards the dramatic