DataQuest.io: Data Science Foundations

2019 · Python · Pandas · Scikit-learn · PyTorch

Business Problem

Starting in data science with no formal ML background, I needed a structured path through the full pipeline: not just theory, but applied work on real datasets. Academic courses covered statistics. What I wanted was the complete journey from raw data to a working model, including data cleaning, feature engineering, modelling, and evaluation on real problems.

Solution

DataQuest's guided curriculum builds up the stack incrementally, one project per topic. Each project runs on a real dataset and nothing is abstracted away: every transformation, cross-validation split, and training loop is written by hand. The progression moves from EDA through classical models to a neural network capstone.

EDA on Hacker News. The dataset contains 300,000 posts from Hacker News. The analysis compared "Ask HN" posts (questions to the community) with "Show HN" posts (projects shared for feedback) to find which type received more comments, and which posting hour correlated with the highest engagement. Ask HN posts averaged more comments overall. Posts submitted between 15:00 and 16:00 EST generated around 38 comments on average, versus a site-wide average of 10.

Feature engineering on house price forecasting. The Ames Housing dataset has 79 variables describing residential properties in Ames, Iowa. The project focused on constructing and selecting informative features: combining basement area columns, transforming skewed numeric distributions, and encoding ordinal categories correctly. Ridge regression was the final model, with cross-validated RMSE as the target metric.

Regression on stock price forecasting. A regression model predicted the next day's closing price from lagged price and volume features. The project was an exercise in feature construction (rolling averages, lag features) and proper time-series train/test splitting. The main lesson was avoiding lookahead bias when building financial features, not producing a serious forecasting system.

Classification on fraud detection. A credit card transaction dataset with heavily imbalanced classes: fraud is rare. The project covered class imbalance handling via undersampling and class weights, threshold tuning, and why accuracy is a misleading metric when 99% of transactions are legitimate. Precision, recall, and F1 score replaced accuracy as the evaluation criteria.

Smaller projects across the curriculum also covered k-nearest neighbours, naive Bayes, and a first pass at gradient boosting.

Technical Challenges

The USPS handwritten digit capstone. In 1988, the US Postal Service digitized handwritten zip codes from envelopes processed at the Buffalo, New York sorting office. The resulting dataset contains 7,291 training images and 2,007 test images, each a 16x16 grayscale scan of a single digit (0-9). The task is the same problem mail sorting machines had to solve: identify the digit correctly so the envelope is routed to the right location.

Samples from the USPS handwritten digit dataset

The architecture is a small convolutional neural network: two Conv2d blocks, each followed by a ReLU activation and MaxPool downsampling, feeding into a fully connected classifier. The training loop tracked validation accuracy each epoch. Final test accuracy landed around 96%, without data augmentation or hyperparameter search. Others who completed this project through DataQuest reported similar results, typically in the 94-97% range depending on architecture choices and number of epochs.

How the network discovers patterns in layers. The most durable lesson from the USPS capstone was how convolutional layers discover structure hierarchically rather than memorising pixels.

Layer 1 filters learn to activate on short horizontal, vertical, and diagonal edges: the basic visual vocabulary of pen strokes. These filters look similar across any digit recognition network because edges appear in every image regardless of the task.

Layer 2 combines layer 1 outputs to detect short arcs, right-angle corners, and curve endpoints. A single curved arc filter activates for the bottom curve of a 2, the tail of a 9, and the opening of a 3.

The final convolutional layer learns shapes that map onto actual digit structure. The clearest example is the closed circular loop. Digits 0, 6, 8, and 9 all contain one as a core component.

Circular loop pattern shared by digits 0, 6, 8, 9

The digit 9 is a loop at the top with a vertical stroke down the right. The digit 8 is two loops stacked vertically. The digit 6 is a loop at the bottom with a curved tail. The network does not learn "what a 6 looks like" and "what a 9 looks like" as separate memorised templates. It learns "loop" and "vertical stroke" as reusable filters and combines them per digit.

A dense network needs a separate weight for every pixel position. A CNN shares the same filter weights across the entire image, which is both parameter-efficient and translation-invariant: the loop detector fires whether the loop appears at the top, centre, or bottom of the frame. This composability is why convolutional networks generalise well to digits they have never seen: the components are shared even when the full digit is new.

Status

Completed 2019
Foundation for all subsequent ML and production model work