Motivation

Oftentimes, I like to dive into open source projects to learn best practices and design patterns programming pundits use to do things correctly and optimally. In addition, Peter Norvig has also said in his famous blog post Teach Yourself Programming in Ten Years

Talk with other programmers; read other programs. This is more important than any book or training course.

I am big advocate of it. This blog post is to emphasize - how reading open source code helps you identify and understand efficient patterns and coding constructs.

Tablib

I admire Kenneth Reitz very much. Do read and follow his The Hitchhiker’s Guide to Python! to be a a great Python programmer. Lesson from this book - Reading Great Code - is the main reason why I decided to give a go at reading source code of Tablib. Reading source code is initially daunting because of certain constructs which are obscure or you may not be familiar with it, and which is natural. Despite such hurdles, if you keep con oncentrating you will find lot of “Aha!” moments by identifying useful patterns. Here is my experience, I came across a very simple yet useful code snippet which is very important and widely used task in data cleaning i.e., removing duplicates.

Source code: tablib removing_duplicates menthod:

def remove_duplicates(self):
    """Removes all duplicate rows from the :class:`Dataset` object
    while maintaining the original order."""
    seen = set()
    self._data[:] = [row for row in self._data if not (tuple(row) in seen or seen.add(tuple(row)))]

Check the if statement followed by generator expression. If you look closely inside generator expression the technique used to check for duplicate rows is called short circuit technique implemented in python.

Short circuit explained by official docs:

Operation	Result	Notes
`x or y`	if x is false, then y, else x	Only evaluates the second argument(`y`) if the first one is `False`.
`x and y`	if x is false, then x, else y	Only evaluates the second argument(`y`) if the first one(`x`) is `True`.
`not x`	if x is false, then True, else False	`not` has a lower priority than non-Boolean operators

remove_duplicates method uses 1st and 3rd Operation from above table.

Key thing to remember is:

The evaluation of expression takes place from left to right.

Explained with toy example:

>>> _data = [[1, 2, 3], [4, 5, 6], [1, 2, 3]]
>>> seen = set()
>>> data_deduplicated = [row for row in _data if not (tuple(row) in seen or seen.add(tuple(row)))]

>>> print(data_deduplicated)
# [[1, 2, 3], [4, 5, 6]]

To put it into words, within list comprehension - iterate over data row by row and check if given row is present within seen set. If it’s not present, meaning

tuple(row) in seen

evaluates to False and as per 1st operation from the table, evaluate second argument which is to add given row in seen set. Furthermore, if not () condition gets satisfied and given row is added to outer list. Subsequently, if the same row occurs then we know it’s already in seen set and hence that row will not be added to outer list. In overall, resulting into removing of duplicate rows.

If you are more of a visual learning person, following demonstartion using Python tutor tool built by an outstanding academic and prolific blogger - Philip Guo - would help*:

*If below IFrame is not visible then please enable “load unsecure script” of your browser. Don’t worry! it’s saying unsecure because of http protocol used by Python tutor and not https.

I hope by now you have understood short circuit technique and importance of reading open source code. So keep exploring and do share your experience with me. Thank you! :)

Code Walkthrough: Tablib, a Python Module for Tabular Datasets

Motivation

Tablib

Share this!