Hi everyone! I've been looking for a while for a csv-formatter that matches my criteria, which is pretty simple: to be fast, really fast. Let me add some context to this: I want to format billions of lines as fast as possible. While benchmarking different libraries, I found the best results with fast-csv, which for 200 millions lines of csv, took me about 40 mins, which is way too much for me(if it were something like 10 mins, i would have considered it). I think those libraries are quite slow because of the underlying RegEx implementations of the formatter which scales pretty bad. I am not really looking for a fancy formatter with tons of options, I just want to make sure the output is a valid csv(it can be parsed with a csv-parser by other people). I need a formatter because my data is pretty funky: it contains "," and pretty much any character you can imagine.In case this information is needed, this is the flow of my app:Fetch documents from a mongodb(I can't control the format of those documents, I am not the one adding them data so any tweak to them is out of bounds) in batches(because there are tons of them)Each document has an array field. For each document, I parse that array and for each element of that array i create an in-memory Json(or array, this format doesn't matter, can be whatever) which i'll add to a list.Once I'm done parsing the array of a document, I give the list to the csv formatter and redirect the output from the formatter to a Readable stream which is part of a ManagedUpload to s3.As a last resort, I'm considering implementing a formatter which simply does some ifs and puts the field that contain "," between quotation marks.
Submitted August 16, 2020 at 10:20AM by Chase2307
No comments:
Post a Comment