Master Text Dataframe Creator:¶

Creates a data a four columns:

Column One: paper id
Column Two: The title of the paper
Column Three: The abstarct if the paper
Column Four: THe main text of the Paper

#### dependencies ####
import pandas as pd
import numpy as np
import os
import json
import datetime

#### Navigate to the your corvid 19 download file #####
%cd "C:\Users\jwr17\Desktop\CORD-19-research-challenge"

C:\Users\jwr17\Desktop\CORD-19-research-challenge

#### understanding the file system ######
file_paths = ["biorxiv_medrxiv\\biorxiv_medrxiv", "comm_use_subset\\comm_use_subset", "custom_license\\custom_license", "noncomm_use_subset\\noncomm_use_subset"]
i=0
for path in file_paths:
    print(f"file folder {path} has a total of {str(len(os.listdir(path)))} files")
    i = i + len(os.listdir(path))
print(f"\n total files are equal to {str(i)}")

file folder biorxiv_medrxiv\biorxiv_medrxiv has a total of 885 files
file folder comm_use_subset\comm_use_subset has a total of 9118 files
file folder custom_license\custom_license has a total of 16959 files
file folder noncomm_use_subset\noncomm_use_subset has a total of 2353 files

 total files are equal to 29315

biorxiv_medrxiv, commercial use, and none comercial use¶

########################################################### reading in the file Loop #################################################################################
master_text=[]
i=0
count = 0
file_paths = ["biorxiv_medrxiv\\biorxiv_medrxiv", "comm_use_subset\\comm_use_subset", "custom_license\\custom_license", "noncomm_use_subset\\noncomm_use_subset"]


for path in file_paths: 
    for  file in os.listdir(path): ### cycles through every json file from a given file path
        try:
            with open(path + "\\" +file, "r") as read_file: ### reading in json file
                data = json.load(read_file)
            read_file.close()

            paper_id = data['paper_id']
            papepr_title = data['metadata']['title']
            
            ##### geting abstract #######
            abs_text = ""
            for main in data['abstract']:
                abs_text = abs_text + ' '+ main['text']
            abs_text

            ####  getting body text ####
            main_text = ""
            for main in data['body_text']:
                main_text = main_text + ' '+ main['text']
            
            ### adding to list ###
            master_text.append({'paper_id': paper_id,'papepr_title': papepr_title, 'abstract': abs_text, 'main_text': main_text, "file_path": re.findall(r".+\\", path)[0][:-2]})
        except:
            i+=1
            print(f"total skiped: {str(i)}")
        count+=1
        if count%500 == 0: f"complted {str(count)} text files as of {str(datetime.datetime.now())}"

        

master_text = pd.DataFrame(master_text)
master_text

%cd "C:\Users\jwr17\Desktop\CORD-19-research-challenge\my_code_dfs"
master_text.to_csv("master_text.csv")

EDA Analysis:¶

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%cd "C:\Users\jwr17\Desktop\CORD-19-research-challenge\my_code_dfs"
master_test = pd.read_csv('master_text.csv')
meta = pd.read_csv("C:\\Users\\jwr17\\Desktop\\CORD-19-research-challenge\\metadata.csv")

C:\Users\jwr17\Anaconda3\envs\tensor 2\lib\importlib\_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  return f(*args, **kwds)
C:\Users\jwr17\Anaconda3\envs\tensor 2\lib\importlib\_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  return f(*args, **kwds)
C:\Users\jwr17\Anaconda3\envs\tensor 2\lib\importlib\_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  return f(*args, **kwds)
C:\Users\jwr17\Anaconda3\envs\tensor 2\lib\importlib\_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  return f(*args, **kwds)
C:\Users\jwr17\Anaconda3\envs\tensor 2\lib\importlib\_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  return f(*args, **kwds)

C:\Users\jwr17\Desktop\CORD-19-research-challenge\my_code_dfs

### papers that are not in json froma
meta[~meta.sha.isin(master_test.paper_id)]

0                                             NaN
1                                             NaN
2                                             NaN
3        aecbc613ebdab36753235197ffb4f35734b5ca63
4                                             NaN
                           ...                   
44215    d4f00f66c732c292fcfc28b19f44daa2fa620901
44216    ec575d33c0d3b34af7644fcfed64af045a75ab63
44217    7f8715a818bfd325bf4413d3c07003d7ce7b6f7e
44218    07e78e218a159c35e9599e3751a99551a271597b
44219    04bc03c90437934a75fc6fdc228817234ef84c3a
Name: sha, Length: 44220, dtype: object

	paper_id	papepr_title	abstract	main_text	file_path
0	0015023cc06b5362d332b3baf348d11567ca2fbb	The RNA pseudoknots in foot-and-mouth disease ...	word count: 194 22 Text word count: 5168 23 2...	VP3, and VP0 (which is further processed to V...	biorxiv_medrxi
1	004f0f8bb66cf446678dc13cf2701feec4f36d76	Healthcare-resource-adjusted vulnerabilities t...		The 2019-nCoV epidemic has spread across Chin...	biorxiv_medrxi
2	00d16927588fb04d4be0e6b269fc02f0d3c2aa7b	Real-time, MinION-based, amplicon sequencing f...	Infectious bronchitis (IB) causes significant...	Infectious bronchitis (IB), which is caused b...	biorxiv_medrxi
3	0139ea4ca580af99b602c6435368e7fdbefacb03	A Combined Evidence Approach to Prioritize Nip...	Nipah Virus (NiV) came into limelight recentl...	Nipah is an infectious negative-sense single-...	biorxiv_medrxi
4	013d9d1cba8a54d5d3718c229b812d7cf91b6c89	Assessing spread risk of Wuhan novel coronavir...	Background: A novel coronavirus (2019-nCoV) e...	In December 2019, a cluster of patients with ...	biorxiv_medrxi
...	...	...	...	...	...
29310	ff5a79ed22ea416e6d89caad1cf0d83dbc741a4b	Understanding Human Coronavirus HCoV-NL63	Even though coronavirus infection of humans i...	Regardless of geographic location, respirator...	noncomm_use_subse
29311	ff6d57f2aad99be129432058665b361dc18747e8	Brief Definitive Report MACROPHAGES GENETICALL...	There is extensive evidence that cultured mac...	Experiments were designed to test whether sub...	noncomm_use_subse
29312	ff83907653a4c4500e8c509ca28169e924742b40	Identification of a Subdomain of CENPB That Is...	We have combined in vivo and in vitro approac...	can function in an autonomous fashion, reloca...	noncomm_use_subse
29313	ffe718db1820f27bf274e3fc519ab78e450de288	Replication enhancer elements within the open ...	We provide experimental evidence of a replica...	Tick-borne encephalitis virus (TBEV) is a hum...	noncomm_use_subse
29314	ffed5d2a31a0c1a0db11905fe378e7735b6d70ca	Supplemental material for the paper "Evidence ...	Israel. *Corresponding author (TT): tamirtul@...	20min). We trimmed the poly-A adaptors from t...	noncomm_use_subse

	sha	source_x	title	doi	pmcid	pubmed_id	license	abstract	publish_time	authors	journal	Microsoft Academic Paper ID	WHO #Covidence	has_full_text	full_text_file
0	NaN	Elsevier	Intrauterine virus infections and congenital h...	10.1016/0002-8703(72)90077-4	NaN	4361535.0	els-covid	Abstract The etiologic basis for the vast majo...	1972-12-31	Overall, James C.	American Heart Journal	NaN	NaN	False	custom_license
1	NaN	Elsevier	Coronaviruses in Balkan nephritis	10.1016/0002-8703(80)90355-5	NaN	6243850.0	els-covid	NaN	1980-03-31	Georgescu, Leonida; Diosi, Peter; Buţiu, Ioan;...	American Heart Journal	NaN	NaN	False	custom_license
2	NaN	Elsevier	Cigarette smoking and coronary heart disease: ...	10.1016/0002-8703(80)90356-7	NaN	7355701.0	els-covid	NaN	1980-03-31	Friedman, Gary D	American Heart Journal	NaN	NaN	False	custom_license
4	NaN	Elsevier	Epidemiology of community-acquired respiratory...	10.1016/0002-9343(85)90361-4	NaN	4014285.0	els-covid	Abstract Upper respiratory tract infections ar...	1985-06-28	Garibaldi, Richard A.	The American Journal of Medicine	NaN	NaN	False	custom_license
13	NaN	Elsevier	Monoclonal antibodies identify multiple epitop...	10.1016/0006-291X(85)91946-1	NaN	2409966.0	els-covid	Abstract Nine hybridoma cell lines secreting a...	1985-06-28	Cherel, Isabelle; Grosclaude, Jeanne; Rouze, P...	Biochemical and Biophysical Research Communica...	NaN	NaN	False	custom_license
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
44207	e44632c9b598cac15ccda521e13c65ca9fcf7426; 6695...	PMC	The Healthy Infant Nasal Transcriptome: A Benc...	10.1038/srep33994	PMC5034274	27658638.0	cc-by	Responses by resident cells are likely to play...	2016 Sep 23	Chu, Chin-Yi; Qiu, Xing; Wang, Lu; Bhattachary...	Sci Rep	NaN	NaN	True	comm_use_subset
44210	a396657b0c580d496e109b82967c1a89d191ee9b; 560c...	PMC	The intrinsic vulnerability of networks to epi...	10.1016/j.ecolmodel.2018.05.013	PMC6039859	30210182.0	cc-by-nc-nd	Contact networks are convenient models to inve...	2018 Sep 10	Strona, G.; Carstens, C.J.; Beck, P.S.A.; Han,...	Ecol Modell	NaN	NaN	True	noncomm_use_subset
44211	7d77a852039f1cfc2c13843ecfa721c1fe49528c; fc02...	PMC	Lung ultrasound as a diagnostic tool for radio...	10.1016/j.rmed.2017.05.007	PMC5480773	28610670.0	cc-by	BACKGROUND: Pneumonia is a leading cause of mo...	2017 Jul	Ellington, Laura E.; Gilman, Robert H.; Chavez...	Respir Med	NaN	NaN	True	comm_use_subset
44212	NaN	Elsevier	Calculating virus spread	10.1016/S0262-4079(20)30402-4	NaN	NaN	els-covid	Getting a full picture of the coronavirus outb...	2020-02-22	Kucharski, Adam	New Scientist	3.006474e+09	#1600	False	custom_license
44213	428d1091cf63872ea81cb3c1632d76c4813748a1; 8244...	PMC	Viral etiology of hospitalized acute lower res...	10.3325/cmj.2013.54.122	PMC3641872	23630140.0	cc-by	AIM: To estimate the proportional contribution...	2013 Apr	Lukšić, Ivana; Kearns, Patrick K; Scott, Fiona...	Croat Med J	NaN	NaN	True	comm_use_subset