Present Day (TL;DR)
Before you go ahead and peek at the answer before understanding how it came about, I urge you to read the backstory and then come
back to this section. By the way, what I will show you below is just like 30%
of the answer. You can fill in the
rest as per your use case.
You make a skeleton like so:
file_name = "folder/largefile.dat"
nrows, frames = 0, []
for chunk in pd.read_csv(file_name, sep = "|", chunksize = 2500000, usecols = ["Col1", "Col2"]):
try:
frames.append(chunk[chunk.Col1.isin(range(10))])
nrows += chunk.shape[0]
print("%d rows processed." % nrows)
except Exception, e:
print(e)
The place of interest is the line
in bold.
I am merely building a list of
data frames by chunk based on the condition that Col1 has values between 0 and 9
inclusive. But you can do a lot of things over there.
Here’s a hint: Create a function where you just take inputs such as the file name, columns
of interest, and, most importantly, the function
you want to apply to each chunk. Maybe you do not want to extract records.
Maybe you want to build count tables, apply transformations, and about a
million other things. Good news is: the limit to what you can do is your
creativity.
One issue I faced is that when
there is an “EOF character found” exception, the skiprows argument does not skip that record. It just stops the
iterator and no exceptions are thrown.
My immediate idea for next steps is
to try parallelizing this relatively snail-paced version that serially
processes the file chunk by chunk.
Early in November 2016 (The month Trump became US President)
The day started like any other day. I came to the office,
fired up my machine, and went to fetch my customary cup of java.
This was the day when everything changed. My faithful
Jupyter Notebook environment was inundated with super-massive delimiter-separated
files upward of 10 GB in size. Suddenly, 30 GB of RAM was not enough. The
application stuttered. MemoryError
exceptions were everywhere.
It was a welcome challenge to work with such large files
without immediately resorting to Spark or such.
I wrote a simple first version that used nrows and skiprows arguments to manually iterate through chunks of the file.
This version basically used a loop counter variable that I multiplied by a chosen chunk size. Then I’d manually set the values as nrows = <chunk size> and skiprows
= i * <chunk size>. One exercise for the reader is to write this “manual”
version.
There was nothing wrong with how the above way worked. But
why not use something built-in and Pythonic? For now, the first way, which uses the chunksize argument, works well, and we use it for anything we do with large files.
In the next part of this post, we will discuss performance
comparison. We will also begin considering more optimal ways of working with
very large files including parallelism.
Watch this space for other posts and articles! If you have any questions you can reach out our team here.
No comments:
Post a Comment