Skip to content

DiDacTexGit/Talk-ProcessingLargeDatawithPandas

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Motivation

Data sets can get large quickly.
You can quickly go from looking at:

  • a few 100 lines and a handful of columns to...
  • a million lines and with hundred of columns.

Python Pandas (with smart use of Categories) can enable one to reduce the size of ones data in memory by up to 90%.

This repository contains a tutorial and supporting scripts to showcase the power of python pandas with categories.

The tutorial located in the file called:

  • slides.md

Web Tutorial

The tutorial is hosted here

Outline

In this tutorial, we will:

  • Learn how Python uses memory with Pandas
  • How to reduce the Pandas' dataframe memory footprint.
  • Learn what data types are
  • Speed up reading in csv files by using categories
  • Reduce the memory footprint by 90%

In a nut shell

Instead of writing "Sunday","Sunday","Sunday"... Pandas with categories says

"Sunday = 1" and the df =[1,1,1].

uint8 "1" takes a lot less memory than "Sunday"

To convert a column to the category you change the dtype via the follow command.

df['column name'].astype('category')

SundaySunday df

Slide Deck

The tutorial can be run locally as an html slide deck. To activate the html you need to first set up a server in the main directory.


This can be done via python (version 3).

python -m  http.server

Then open a browser and type the following url http://localhost:8000

Reference