Monday, 27 July 2020

Need Help Parsing 90k JSON files

I have an assignment to create a simple site that parses a huge dateset of clinicals trials and returns the authors with most papers on a specific disease using a specific drug. Below is the code that I used:async function searchFilesInDirectoryAsync(dir, disease, drug) { const files = await fsReaddir(dir); console.time('test'); var i = 1 ; for (file of files) { console.log(i + '/89133'); i++; var fileContent = await fsReadFile(path.join(__dirname, dir, file)); fileContent = JSON.parse(fileContent); const texts = fileContent["body_text"]; const reducer = (accumulator, currentValue) => accumulator + currentValue['text']; const text = texts.reduce(reducer); const regexDisease = new RegExp('\\b' + disease + '\\b', 'i'); const regexDrug = new RegExp('\\b' + drug + '\\b', 'i'); if (regexDisease.test(text) && regexDrug.test(text)) { console.log('Found') } else { console.log('Not Found'); } } console.timeEnd('test'); } The problem is that the above code takes around 26 minutes for 89k files, the 89k files amount to around 1.5 gb of JSON files.The only way I can think of to increase performance here is to not use the reduce method to glue all the texts of a paper together but instead to run a regex on each text.The body text in the JSON files contains an array of many texts that all together make up the whole body text. One of the problems is that with the above code my PC caps my disk usage to 100% so it's bound there. Different ideas I had:- Run lots of processes in parallel- Find a way to regex for drug and disease in the same time though I found that it's not possible- Have node js call a executable from another programming language that has better performance- Maybe somehow to load the whole 1.5 gb dataset in memory for quicker readsI haven't tried any of the above mentioned ideas since I though I may get some good advice here on how to handle this problem.Any help is appreciated and thanks in advance for any answers.EDIT:Also below is a breakdown of how much time each step takes:read - 6.039msJSONParse - 0.387msreduce - 0.027msregex - 0.158mstest - 7.381ms

Submitted July 27, 2020 at 10:01AM by kimonides9

No comments:

Post a Comment