Skip to content

Latest commit

 

History

History
67 lines (42 loc) · 1.7 KB

File metadata and controls

67 lines (42 loc) · 1.7 KB

Motivation

Data sets can get large quickly.
You can quickly go from looking at:

  • a few 100 lines and a handful of columns to...
  • a million lines and with hundred of columns.

Python Pandas (with smart use of Categories) can enable one to reduce the size of ones data in memory by up to 90%.

This repository contains a tutorial and supporting scripts to showcase the power of python pandas with categories.

The tutorial located in the file called:

  • slides.md

Web Tutorial

The tutorial is hosted here

Outline

In this tutorial, we will:

  • Learn how Python uses memory with Pandas
  • How to reduce the Pandas' dataframe memory footprint.
  • Learn what data types are
  • Speed up reading in csv files by using categories
  • Reduce the memory footprint by 90%

In a nut shell

Instead of writing "Sunday","Sunday","Sunday"... Pandas with categories says

"Sunday = 1" and the df =[1,1,1].

uint8 "1" takes a lot less memory than "Sunday"

To convert a column to the category you change the dtype via the follow command.

df['column name'].astype('category')

SundaySunday df

Slide Deck

The tutorial can be run locally as an html slide deck. To activate the html you need to first set up a server in the main directory.


This can be done via python (version 3).

python -m  http.server

Then open a browser and type the following url http://localhost:8000

Reference