Auditing your data using Python

Python“Python” and “R” are amongst the most popular open source programming languages for data science. While R’s functionality was developed with statisticians in mind, Python on the other hand is a general-purpose programming language, with an easy to understand syntax and a gentle learning curve. Historically, R has been used primarily for academic and research work, although anecdotal evidence suggests that R is starting to see some level of adoption in the enterprise world (especially in the Financial Services sector) due to its robust data visualisation capabilities. Python, on the other hand, is widely used in enterprises due to the breadth of its functionalities that span beyond data analytics and also because of the ease with which Python based solutions can be integrated with many other technologies in a typical enterprise set up.

In addition to Python and R, there is also a wide variety of very powerful commercial data analysis software. However, Python has several advantages over these commercial offerings as follows –

a) Python’s open source license (GPL compatible, but you can distribute a modified version without making your changes open source) means that it can be used for free. Commercial packages on the other hand come with licensing constraints and the associated cost factor can often limit their availability to a handful of staff in an organisation.

b) Unlike many commercial data analysis software, Python can be used even on a low specification Desktop computer, making it suitable for large scale deployment without additional investment in hardware. Data analysis codes written in native Python can also be used in multiple computing platforms and operating systems that support Python (e.g. Windows, Linux and MacOS).

c) Most (if not all) commercial data analysis software are designed for interactive use, often making them unsuitable for implementing fully automated and reusable data analytics solutions. Python codes, on the other hand, can be used to fully automate the entire data analysis process, and can also be distributed and re-used without constraints.

d) The worldwide Python community is constantly adding new packages and features to its already rich set of functions. Due to the size and scale of community support, new data analysis techniques coming out of academia and research also become freely available in Python much quicker than in a commercial offering.

e) There are a number of online discussion forums dedicated to Python for knowledge sharing. The PyData conferences also provide a valuable channel for exchanging information on new approach and emerging open source technologies for data management, processing, analytics and visualisation. Video recordings of the PyData conference proceedings are freely available on YouTube.

f) Generally speaking, there are more people with Python programming skills than with working knowledge of commercial data analysis software. Python is also gaining increasing popularity as an introductory programming language in many schools and universities world-wide. We are, therefore, very likely to see an increase in the number of people with Python programming skills in the near future.

Data analysis capabilities of Python

Python has an extremely rich set of data analysis functionalities, which in my view are more than adequate to meet even the needs of an advanced data analytics practitioner. I have summarised below some of the key data analysis capabilities of Python (you will find a number of real-life examples on how to use these capabilities in my book “Practical Data Analysis – Using Open Source Tools & Techniques”). Readers who are interested in a more comprehensive list of useful Python packages should visit the Awesome Python website, where a “curated list of awesome Python frameworks, libraries, software and resources” is maintained.

a) “Pandas” is probably one of the most important Python libraries for data analysis. Conceptually, Pandas is similar to Microsoft Excel (excluding the point and click functionality of Excel) in that it allows you to open, explore, change, update, analyse and save data in a tabular structure. However, in my opinion, Pandas is way more powerful than Microsoft Excel as a data analysis tool for the following reasons – (i) Pandas combined with the broader Python ecosystem provides a much richer set of data analysis functionalities than Microsoft Excel; (ii) unlike Microsoft Excel, Pandas can load and process extremely large data files even on a low specification Desktop computer; (iii) coding your data analysis logic in Pandas ensures that the same set of rules are operated every time you run the analysis; (iv) once you have written your data analysis logic in Pandas, you can re-use it for analysing a completely different data set with minimal changes; and (v) it is much easier to share a piece of code with a friend or colleague, rather than a long set of instructions describing a series of manual operations in Excel.

b) “NumPy”, “SciPy” and “StatsModels” are Python libraries for scientific and financial computation, economic research and applied econometrics. These libraries enable Python to be used in the areas of statistics, linear algebra, Fourier transformation, signal processing, image processing, genetic algorithm, econometric analysis and much more!

c) Machine Learning is the latest buzzword amongst technologists and data scientists, and a data analysis toolkit will be incomplete without Machine Learning capabilities. Built on top of NumPy and SciPy, Python’s “Scikit-Learn” library provides a range of out of the box implementation of supervised and un-supervised Machine Learning algorithms for data mining and data analysis. A more recent addition to the Python open source Machine Learning toolkit is “TensorFlow” from Google. TensorFlow is designed for large scale and computationally intense Machine Learning (such as shallow and deep learning neural networks). It allows you to define in Python a graph of the computations to perform and then TensorFlow takes that graph and runs it efficiently using optimised C++ code. Other noteworthy Machine Learning libraries in Python include “Keras”, “Theano”, “Pylearn2” and “Pyevolve”.

d) “NLTK” (Natural Language Toolkit) is a leading platform for building Python programs to work with human language data. It provides a suite of text processing libraries for classification, tokenisation, stemming, tagging, parsing, semantic reasoning and wrappers for industrial-strength Natural Language Processing (NLP) tasks. “TextBlob”, a complementary Python NLP library, makes text processing simple by providing an intuitive interface to NLTK. A third Python NLP library called “Gensim” makes text mining, topic modelling and document similarity analysis tasks look easy!! And last but not the least is “SpaCy”, a relatively new Python NLP library featuring state-of-the-art speed and accuracy needed for real world use cases of NLP.

e) Data visualisation is an essential step in any rigorous data analysis process and the Python community have left no stone unturned to make all kinds of data visualisation tools and techniques available to Python users. “Matplotlib” is the grandfather of Python data visualisation packages. It is extremely powerful but with that power comes complexity. “Seaborn” harnesses the power of Matplotlib to create beautiful charts in a few lines of code. However, Seaborn was perhaps created with statistical visualisations in mind and is extremely useful when you have to plot a lot of different variables. “Bokeh” and “Pygal” libraries offers Python users the ability to create interactive and web- ready plots. “Geoplotlib” and “Matplotlib basemap” are Python toolkits for creating maps and plotting geographical data. “Missingno” allows you to quickly gauge the completeness of a data set with a visual summary.

f) Regular Time Series is a sequence of data points collected at constant time intervals (e.g. daily closing share price of a listed company for the past six month). While you can implement your own Time Series data analysis methods using a combination of the Pandas, NumPy, Scikit-Learn, SciPy and StatsModels libraries, there are currently two off-the-shelf Python packages for time series forecasting – “Prophet”, an open source Python package created by the Facebook data science team, and “PyFlux”, which is built on top of NumPy, SciPy and Pandas. Using these libraries, one can automate the process of analysing time series data to forecast future values of the series (e.g. tomorrow’s closing share price of a listed company, although I will caveat this statement with a word of caution – forecasting and making a prediction are two different things) and the degree of uncertainty associated with the forecast.

g) In April 2017, Microsoft announced a preview release of SQL Server 2017 that is capable of in-database analytics and machine learning using Python. This feature will allow Python based data analysis models to be built inside the SQL Server itself, thereby eliminating the need to move data from the database to the models. Currently, Python also provides a number of off-the-shelf packages for extracting data from different types of data sources, such as traditional databases (the full list of Python database drivers is available on Awesome MySQL website), Excel files (“Pandas”, “openpyxl” and “pyexcel” libraries), Word documents (“python-docx” and “textract” libraries), PDF documents (“PDFMiner” library), CSVfiles (“Pandas” library) and Internet websites (“lassie”, “micawber” and “newspaper” libraries).

h) While “Jupyter Notebook” is not a tool for data analysis, it is still worth listing it here. Jupyter Notebook provides a rich web browser (e.g. Internet Explorer) based user interface for writing Python codes, as well as adding explanatory narratives and displaying the outputs of the code in a single document. Jupyter Notebook makes it extremely easy to document and share your data analysis work with friends and colleagues and publish it on the Internet.

To summarise, if you choose to invest some of your time to develop practical data analytics skills, give Python a try!

Wednesday, January 30, 2019 In: Hot Topics, Newsletters Comments (None)

Contact us

3 Appleton Court, Calder Park
Wakefield, WF2 7AR

+44 (0) 1924 254 101

Mailing List

Subscribe to our newsletter.