Python & Data Science

  • Accelerate - The Science of Lean Software and Devops

    A review of key concepts I learned in the book Accelerate: The Science of Lean Software and Devops

  • How to Help People When You Aren't the Subject Matter Expert

    The ability to help a colleague or direct report when they’re stuck is an important part of being an engineer. Sometimes, you lack the same context or technical expertise as the person seeking help. This is a strategy I learned for being helpful in these situations.

  • Clustering Frequency Domain Data

    Unsupervised clustering algorithms can be a great way to explore any structure that is inherent to the data and perhaps not immediately obvious to the analyst.

  • Clustering Frequency Domain Data

    Unsupervised clustering algorithms can be a great way to explore any structure that is inherent to the data and perhaps not immediately obvious to the analyst.

  • Analzying the Gold Market in Pure SQL

    As Gold stages a break-out that could be one for the record books, I thought I’d take the time to do a SQL tutorial by doing a deep dive into price facts and behaviours of this amazing asset class.

  • Binary Tree Methods in Python

    In this post I show you a class for creating binary trees (and a cool way to display them!), as well as some methods for analyzing binary trees. Enjoy!

  • A Complete Guide to Data Structures in Python

    This is a post I’ve always wanted to put together - it’s a monster, comprehensive guide to implementing data structures in Python, complete with code, explanations, and use cases.

  • Stream Data to Google BigQuery with Apache Beam

    In this post I walk through the process of handling unbounded streaming data using Apache Beam, and pushing it to Google BigQuery as a data warehouse.

  • Train and Evaluate Machine Learning Models in Google BigQuery

    In this post I walk-through one of the coolest features in Google BigQuery - the ability to training and evaluate machine learning models directly in BigQuery using SQL syntax. I’m going to import Google’s public e-commerce dataset from Google Analytics and build a machine learning model that predicts return buyers.

  • All About Logging in Python

  • All About Testing in Python

  • Create Fast, Fault-Tolerant ETL Pipelines with Google Kubernetes

    In this post I do a walk-through demonstrating how to distribute a data ingestion process across a Kubernetes cluster to achieve fast, inexpensive, and fault-tolerant data pipelines on Google Cloud Platform. This model can be used for many kinds of distributed computing - not just data pipelines! I enjoyed learning this because it cut my data processing costs for VanAurum significantly. I hope you enjoy it!

  • Building a Brain - Distributing Machine Learning Models in NoSQL

    In this post I walk through an architecture model for building better operational intelligence into VanAurum by distributing and accessing many machine learning models in MongoDB, a popular open source NoSQL database framework.

  • Must-Know Methods for Python Coding Interviews

    As a self-taught developer, you end up going through (and failing) a lot of coding interviews before you’re able to get a grasp of what’s required to be successful. There is no substite for practice, but in this post I walk through a collection of recipes I have made that address common themes I’ve noticed in interviews. Enjoy!

  • Relational vs. Non-Relational Databases
  • Analyzing the S&P 500 with PySpark

    The Spark dataframe API is moving undeniably towards the look and feel of Pandas dataframes, but there are some key differences in the way these two libraries operate. In this post I walk through an analysis of the S&P500 to illustrate common data analysis functionality in PySpark.

  • How to implement Bayesian Optimization in Python

    In this post I do a complete walk-through of implementing Bayesian hyperparameter optimization in Python. This method of hyperparameter optimization is extremely fast and effective compared to other “dumb” methods like GridSearchCV and RandomizedSearchCV.

  • Complete Guide to Installing PySpark on MacOS

    Getting PySpark set up locally can be a bit of an involved process that took me a few tries to get right. In this post I cover the entire process of succesfully installing PySpark on MacOS. Enjoy!

  • Analyzing the Cost Benefits of Robo-Advisors

    Robo-advisors are on a clear path to dominating the future of asset management. I was curious about how beneficial some of the cost benefits of robo advisors can be on the growth of a portfolio. In this series we perform some hypothesis testing to analyze the benefits offered by robo-advisors rather than directing your own portfolio. How significant are they?

  • Support Vector Machine Hyperparameter Tuning - A Visual Guide

    In this post I walk through the powerful Support Vector Machine (SVM) algorithm and use the analogy of sorting M&M’s to illustrate the effects of tuning SVM hyperparameters.

  • XGBoost Hyperparameter Tuning - A Visual Guide

    XGBoost is a very powerful machine learning algorithm that is typically a top performer in data science competitions. In this post I’m going to walk through the key hyperparameters that can be tuned for this amazing algorithm, vizualizing the process as we go so you can get an intuitive understanding of the effect the changes have on the decision boundaries.

  • An Intuitive Walk-through of Linear Regression in Python

    In this article I do my best to explain some of the more confusing regression-related topics using something that we can all relate to - pizza! Through the course of the tutorial, I introduce regressions, the intuition behind them, how to execute them in Python, and how to interpret the results.

  • Algorithmic Portfolio Optimization in Python

    In this installment I demonstrate the code and concepts required to build a Markowitz Optimal Portfolio in Python, including the calculation of the capital market line. I build flexible functions that can optimize portfolios for Sharpe ratio, maximum return, and minimal risk.

  • Selecting Machine Learning Algorithms Part 1

    In this post I demonstrate how to build a spot-checking algorithm that can evaluate a basket of machine learning algorithms on scaled and un-scaled data. By establishing a baseline performance you can then move on to forming and testing hypotheses regarding how transformations and parameter tweaks might affect your model performance.

Kevin Vecmanis