Data sets can get large quickly.
You can quickly go from looking at:
- a few 100 lines and a handful of columns to...
- a million lines and with hundred of columns.
Python Pandas (with smart use of Categories) can enable one to reduce the size of ones data in memory by up to 90%.
This repository contains a tutorial and supporting scripts to showcase the power of python pandas with categories.
The tutorial located in the file called:
- slides.md
The tutorial is hosted here
In this tutorial, we will:
- Learn how Python uses memory with Pandas
- How to reduce the Pandas' dataframe memory footprint.
- Learn what data types are
- Speed up reading in csv files by using categories
- Reduce the memory footprint by 90%
Instead of writing "Sunday","Sunday","Sunday"... Pandas with categories says
"Sunday = 1" and the df =[1,1,1].
uint8 "1" takes a lot less memory than "Sunday"
To convert a column to the category you change the dtype via the follow command.
df['column name'].astype('category')
The tutorial can be run locally as an html slide deck. To activate the html you need to first set up a server in the main directory.
This can be done via python (version 3).
python -m http.server
Then open a browser and type the following url http://localhost:8000