Master Text Dataframe Creator:

Creates a data a four columns:

  • Column One: paper id
  • Column Two: The title of the paper
  • Column Three: The abstarct if the paper
  • Column Four: THe main text of the Paper

In [121]:
#### dependencies ####
import pandas as pd
import numpy as np
import os
import json
import datetime

#### Navigate to the your corvid 19 download file #####
%cd "C:\Users\jwr17\Desktop\CORD-19-research-challenge"
C:\Users\jwr17\Desktop\CORD-19-research-challenge
In [148]:
#### understanding the file system ######
file_paths = ["biorxiv_medrxiv\\biorxiv_medrxiv", "comm_use_subset\\comm_use_subset", "custom_license\\custom_license", "noncomm_use_subset\\noncomm_use_subset"]
i=0
for path in file_paths:
    print(f"file folder {path} has a total of {str(len(os.listdir(path)))} files")
    i = i + len(os.listdir(path))
print(f"\n total files are equal to {str(i)}")
file folder biorxiv_medrxiv\biorxiv_medrxiv has a total of 885 files
file folder comm_use_subset\comm_use_subset has a total of 9118 files
file folder custom_license\custom_license has a total of 16959 files
file folder noncomm_use_subset\noncomm_use_subset has a total of 2353 files

 total files are equal to 29315

biorxiv_medrxiv, commercial use, and none comercial use

In [153]:
########################################################### reading in the file Loop #################################################################################
master_text=[]
i=0
count = 0
file_paths = ["biorxiv_medrxiv\\biorxiv_medrxiv", "comm_use_subset\\comm_use_subset", "custom_license\\custom_license", "noncomm_use_subset\\noncomm_use_subset"]


for path in file_paths: 
    for  file in os.listdir(path): ### cycles through every json file from a given file path
        try:
            with open(path + "\\" +file, "r") as read_file: ### reading in json file
                data = json.load(read_file)
            read_file.close()

            paper_id = data['paper_id']
            papepr_title = data['metadata']['title']
            
            ##### geting abstract #######
            abs_text = ""
            for main in data['abstract']:
                abs_text = abs_text + ' '+ main['text']
            abs_text

            ####  getting body text ####
            main_text = ""
            for main in data['body_text']:
                main_text = main_text + ' '+ main['text']
            
            ### adding to list ###
            master_text.append({'paper_id': paper_id,'papepr_title': papepr_title, 'abstract': abs_text, 'main_text': main_text, "file_path": re.findall(r".+\\", path)[0][:-2]})
        except:
            i+=1
            print(f"total skiped: {str(i)}")
        count+=1
        if count%500 == 0: f"complted {str(count)} text files as of {str(datetime.datetime.now())}"

        

master_text = pd.DataFrame(master_text)
master_text
Out[153]:
paper_id papepr_title abstract main_text file_path
0 0015023cc06b5362d332b3baf348d11567ca2fbb The RNA pseudoknots in foot-and-mouth disease ... word count: 194 22 Text word count: 5168 23 2... VP3, and VP0 (which is further processed to V... biorxiv_medrxi
1 004f0f8bb66cf446678dc13cf2701feec4f36d76 Healthcare-resource-adjusted vulnerabilities t... The 2019-nCoV epidemic has spread across Chin... biorxiv_medrxi
2 00d16927588fb04d4be0e6b269fc02f0d3c2aa7b Real-time, MinION-based, amplicon sequencing f... Infectious bronchitis (IB) causes significant... Infectious bronchitis (IB), which is caused b... biorxiv_medrxi
3 0139ea4ca580af99b602c6435368e7fdbefacb03 A Combined Evidence Approach to Prioritize Nip... Nipah Virus (NiV) came into limelight recentl... Nipah is an infectious negative-sense single-... biorxiv_medrxi
4 013d9d1cba8a54d5d3718c229b812d7cf91b6c89 Assessing spread risk of Wuhan novel coronavir... Background: A novel coronavirus (2019-nCoV) e... In December 2019, a cluster of patients with ... biorxiv_medrxi
... ... ... ... ... ...
29310 ff5a79ed22ea416e6d89caad1cf0d83dbc741a4b Understanding Human Coronavirus HCoV-NL63 Even though coronavirus infection of humans i... Regardless of geographic location, respirator... noncomm_use_subse
29311 ff6d57f2aad99be129432058665b361dc18747e8 Brief Definitive Report MACROPHAGES GENETICALL... There is extensive evidence that cultured mac... Experiments were designed to test whether sub... noncomm_use_subse
29312 ff83907653a4c4500e8c509ca28169e924742b40 Identification of a Subdomain of CENPB That Is... We have combined in vivo and in vitro approac... can function in an autonomous fashion, reloca... noncomm_use_subse
29313 ffe718db1820f27bf274e3fc519ab78e450de288 Replication enhancer elements within the open ... We provide experimental evidence of a replica... Tick-borne encephalitis virus (TBEV) is a hum... noncomm_use_subse
29314 ffed5d2a31a0c1a0db11905fe378e7735b6d70ca Supplemental material for the paper "Evidence ... Israel. *Corresponding author (TT): tamirtul@... 20min). We trimmed the poly-A adaptors from t... noncomm_use_subse

29315 rows × 5 columns

In [ ]:
%cd "C:\Users\jwr17\Desktop\CORD-19-research-challenge\my_code_dfs"
master_text.to_csv("master_text.csv")

EDA Analysis:

In [3]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%cd "C:\Users\jwr17\Desktop\CORD-19-research-challenge\my_code_dfs"
master_test = pd.read_csv('master_text.csv')
meta = pd.read_csv("C:\\Users\\jwr17\\Desktop\\CORD-19-research-challenge\\metadata.csv")
C:\Users\jwr17\Anaconda3\envs\tensor 2\lib\importlib\_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  return f(*args, **kwds)
C:\Users\jwr17\Anaconda3\envs\tensor 2\lib\importlib\_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  return f(*args, **kwds)
C:\Users\jwr17\Anaconda3\envs\tensor 2\lib\importlib\_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  return f(*args, **kwds)
C:\Users\jwr17\Anaconda3\envs\tensor 2\lib\importlib\_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  return f(*args, **kwds)
C:\Users\jwr17\Anaconda3\envs\tensor 2\lib\importlib\_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  return f(*args, **kwds)
C:\Users\jwr17\Desktop\CORD-19-research-challenge\my_code_dfs
In [13]:
### papers that are not in json froma
meta[~meta.sha.isin(master_test.paper_id)]
Out[13]:
sha source_x title doi pmcid pubmed_id license abstract publish_time authors journal Microsoft Academic Paper ID WHO #Covidence has_full_text full_text_file
0 NaN Elsevier Intrauterine virus infections and congenital h... 10.1016/0002-8703(72)90077-4 NaN 4361535.0 els-covid Abstract The etiologic basis for the vast majo... 1972-12-31 Overall, James C. American Heart Journal NaN NaN False custom_license
1 NaN Elsevier Coronaviruses in Balkan nephritis 10.1016/0002-8703(80)90355-5 NaN 6243850.0 els-covid NaN 1980-03-31 Georgescu, Leonida; Diosi, Peter; Buţiu, Ioan;... American Heart Journal NaN NaN False custom_license
2 NaN Elsevier Cigarette smoking and coronary heart disease: ... 10.1016/0002-8703(80)90356-7 NaN 7355701.0 els-covid NaN 1980-03-31 Friedman, Gary D American Heart Journal NaN NaN False custom_license
4 NaN Elsevier Epidemiology of community-acquired respiratory... 10.1016/0002-9343(85)90361-4 NaN 4014285.0 els-covid Abstract Upper respiratory tract infections ar... 1985-06-28 Garibaldi, Richard A. The American Journal of Medicine NaN NaN False custom_license
13 NaN Elsevier Monoclonal antibodies identify multiple epitop... 10.1016/0006-291X(85)91946-1 NaN 2409966.0 els-covid Abstract Nine hybridoma cell lines secreting a... 1985-06-28 Cherel, Isabelle; Grosclaude, Jeanne; Rouze, P... Biochemical and Biophysical Research Communica... NaN NaN False custom_license
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
44207 e44632c9b598cac15ccda521e13c65ca9fcf7426; 6695... PMC The Healthy Infant Nasal Transcriptome: A Benc... 10.1038/srep33994 PMC5034274 27658638.0 cc-by Responses by resident cells are likely to play... 2016 Sep 23 Chu, Chin-Yi; Qiu, Xing; Wang, Lu; Bhattachary... Sci Rep NaN NaN True comm_use_subset
44210 a396657b0c580d496e109b82967c1a89d191ee9b; 560c... PMC The intrinsic vulnerability of networks to epi... 10.1016/j.ecolmodel.2018.05.013 PMC6039859 30210182.0 cc-by-nc-nd Contact networks are convenient models to inve... 2018 Sep 10 Strona, G.; Carstens, C.J.; Beck, P.S.A.; Han,... Ecol Modell NaN NaN True noncomm_use_subset
44211 7d77a852039f1cfc2c13843ecfa721c1fe49528c; fc02... PMC Lung ultrasound as a diagnostic tool for radio... 10.1016/j.rmed.2017.05.007 PMC5480773 28610670.0 cc-by BACKGROUND: Pneumonia is a leading cause of mo... 2017 Jul Ellington, Laura E.; Gilman, Robert H.; Chavez... Respir Med NaN NaN True comm_use_subset
44212 NaN Elsevier Calculating virus spread 10.1016/S0262-4079(20)30402-4 NaN NaN els-covid Getting a full picture of the coronavirus outb... 2020-02-22 Kucharski, Adam New Scientist 3.006474e+09 #1600 False custom_license
44213 428d1091cf63872ea81cb3c1632d76c4813748a1; 8244... PMC Viral etiology of hospitalized acute lower res... 10.3325/cmj.2013.54.122 PMC3641872 23630140.0 cc-by AIM: To estimate the proportional contribution... 2013 Apr Lukšić, Ivana; Kearns, Patrick K; Scott, Fiona... Croat Med J NaN NaN True comm_use_subset

16530 rows × 15 columns

In [7]:
 
Out[7]:
0                                             NaN
1                                             NaN
2                                             NaN
3        aecbc613ebdab36753235197ffb4f35734b5ca63
4                                             NaN
                           ...                   
44215    d4f00f66c732c292fcfc28b19f44daa2fa620901
44216    ec575d33c0d3b34af7644fcfed64af045a75ab63
44217    7f8715a818bfd325bf4413d3c07003d7ce7b6f7e
44218    07e78e218a159c35e9599e3751a99551a271597b
44219    04bc03c90437934a75fc6fdc228817234ef84c3a
Name: sha, Length: 44220, dtype: object