Oftentimes, I like to dive into open source projects to learn best practices and design patterns programming pundits use to do things correctly and optimally. In addition, Peter Norvig has also said in his famous blog post Teach Yourself Programming in Ten Years
Talk with other programmers; read other programs. This is more important than any book or training course.
I am big advocate of it. This blog post is to emphasize - how reading open source code helps you identify and understand efficient patterns and coding constructs.
I admire Kenneth Reitz very much. Do read and follow his The Hitchhiker’s Guide to Python! to be a a great Python programmer. Lesson from this book - Reading Great Code - is the main reason why I decided to give a go at reading source code of Tablib. Reading source code is initially daunting because of certain constructs which are obscure or you may not be familiar with it, and which is natural. Despite such hurdles, if you keep con oncentrating you will find lot of “Aha!” moments by identifying useful patterns. Here is my experience, I came across a very simple yet useful code snippet which is very important and widely used task in data cleaning i.e., removing duplicates.
def remove_duplicates(self): """Removes all duplicate rows from the :class:`Dataset` object while maintaining the original order.""" seen = set() self._data[:] = [row for row in self._data if not (tuple(row) in seen or seen.add(tuple(row)))]
if statement followed by generator expression. If you look closely inside generator expression the technique used to check for duplicate rows is called short circuit technique implemented in python.
||if x is false, then y, else x||Only evaluates the second argument(
||if x is false, then x, else y||Only evaluates the second argument(
||if x is false, then True, else False||
remove_duplicates method uses 1st and 3rd Operation from above table.
Key thing to remember is:
The evaluation of expression takes place from left to right.
Explained with toy example:
>>> _data = [[1, 2, 3], [4, 5, 6], [1, 2, 3]] >>> seen = set() >>> data_deduplicated = [row for row in _data if not (tuple(row) in seen or seen.add(tuple(row)))] >>> print(data_deduplicated) # [[1, 2, 3], [4, 5, 6]]
To put it into words, within list comprehension - iterate over data row by row and check if given row is present within
seen set. If it’s not present, meaning
tuple(row) in seen
False and as per 1st operation from the table, evaluate second argument which is to add given row in
seen set. Furthermore,
if not () condition gets satisfied and given row is added to outer list. Subsequently, if the same row occurs then we know it’s already in
seen set and hence that row will not be added to outer list. In overall, resulting into removing of duplicate rows.
*If below IFrame is not visible then please enable “load unsecure script” of your browser. Don’t worry! it’s saying unsecure because of http protocol used by Python tutor and not https.
I hope by now you have understood short circuit technique and importance of reading open source code. So keep exploring and do share your experience with me. Thank you! :)