November 25, 2016

Wrangling Large Data in Python (Part 1)

Present Day (TL;DR)

Before you go ahead and peek at the answer before understanding how it came about, I urge you to read the backstory and then come back to this section. By the way, what I will show you below is just like 30% of the answer. You can fill in the rest as per your use case.

You make a skeleton like so:

file_name = "folder/largefile.dat"  
nrows, frames = 0, []
for chunk in pd.read_csv(file_name, sep = "|", chunksize = 2500000, usecols = ["Col1""Col2"]):
        nrows += chunk.shape[0]
        print("%d rows processed." % nrows)
    except Exception, e:

The place of interest is the line in bold.
I am merely building a list of data frames by chunk based on the condition that Col1 has values between 0 and 9 inclusive. But you can do a lot of things over there.
Here’s a hint: Create a function where you just take inputs such as the file name, columns of interest, and, most importantly, the function you want to apply to each chunk. Maybe you do not want to extract records. Maybe you want to build count tables, apply transformations, and about a million other things. Good news is: the limit to what you can do is your creativity.
One issue I faced is that when there is an “EOF character found” exception, the skiprows argument does not skip that record. It just stops the iterator and no exceptions are thrown.
My immediate idea for next steps is to try parallelizing this relatively snail-paced version that serially processes the file chunk by chunk.

Early in November 2016 (The month Trump became US President)

The day started like any other day. I came to the office, fired up my machine, and went to fetch my customary cup of java.

This was the day when everything changed. My faithful Jupyter Notebook environment was inundated with super-massive delimiter-separated files upward of 10 GB in size. Suddenly, 30 GB of RAM was not enough. The application stuttered. MemoryError exceptions were everywhere.
It was a welcome challenge to work with such large files without immediately resorting to Spark or such.

I wrote a simple first version that used nrows and skiprows arguments to manually iterate through chunks of the file. This version basically used a loop counter variable that I multiplied by a chosen chunk size. Then I’d manually set the values as nrows = <chunk size> and skiprows = i * <chunk size>. One exercise for the reader is to write this “manual” version.

There was nothing wrong with how the above way worked. But why not use something built-in and Pythonic? For now, the first way, which uses the chunksize argument, works well, and we use it for anything we do with large files.

In the next part of this post, we will discuss performance comparison. We will also begin considering more optimal ways of working with very large files including parallelism.

Watch this space for other posts and articles! If you have any questions you can reach out our team here.

No comments:

Post a Comment