data science projects
Below are some projects I've worked on. Each one's been an invaluable and eye-opening learning experience. I'm looking forward to the many more to come.
Supervised Variational Autoencoders
In this group project for Columbia’s Applied Deep Learning course, my classmates and I attempted an experimental architecture that combines the variational autoencoder with traditional supervised learning. We found that the inclusion of variational layers to image classification architectures regularizes the model and improves image classification performance.
Neural Network Explainability
Neural networks, while powerful, are notorious for their opaqueness; they’re black boxes that produce predictions without much insight into the "reasoning” behind the prediction. This motivates a broad field of research into Explainable AI (XAI). My summer intern project at NASA Langley Research Center looked into how Variational Autoencoders can be modified to produce interpretable dimensionality reduction on image data and was later published as a technical paper.
Spotify Song Lyric Analysis
This is a comprehensive analysis of almost six decades of mainstream American music. With data combined from the annual Billboard Hot 100 rankings, lyrics scraped from various web sources, and proprietary audio features curated by Spotify, the dataset was rich with possibilities. In addition to performing some traditional analysis like visualizing patterns over time or exploring various correlation structures, I found answers to questions like, "What lyrics are most typical of each decade?" The answer isn't pretty.
Extending DeepER: Deep Learning for Entity Resolution
Entity resolution is a difficult problem in data management. For example, given two datasets, one of products from Amazon and another of products from Walmart, how can we tell when both datasets are referencing the same product if they use different naming conventions and item descriptions? DeepER is a framework that leverages deep learning and word embeddings to address these issues, and this project examines various methods for extending DeepER's capabilities.
Amazon Film & TV Recommendations
This final group project for Brett Vintch's Personalization Theory course surveys the effectiveness of various classic collaborative filtering and matrix factorization algorithms for predicting consumers' ratings for products in Amazon's Film & TV catalogue based on their past ratings. We also implemented algorithms from scratch, including PLSI (probabilistic latent semantic indexing) and an approximate nearest neighbors method called cosine-based LSH (locality sensitive hashing).
TweetRater
This personal project completed right before starting my master's program examines tweets and classifies them as either inoffensive, offensive, or hate speech. I compared the classification performance of a Naive Bayes model, vanilla neural network, and a 1-D convolutional network trained on word embeddings and found the convolutional network did best. I also built a simple web-app where you can write your own tweets and explore how the model reacts.
Likely Voter Prediction
This was my first data science project. Following the 2016 presidential election and the failure of political pollsters to accurately capture the national mood, I was curious how they go about deciding whether a respondent is a likely voter. Much to my surprise, pollsters were still using heuristics developed decades ago and the literature on using machine learning to build likely voter models was extremely thin. I trained an ensemble classifier to predict a respondent's chance of voting and assessed its performance as a likely voter model against a baseline with no such model.
Implementations of Machine Learning Algorithms
I didn't realize I loved implementing machine learning algorithms from scratch until I took John Paisley's Machine Learning class at Columbia. There's something incredibly satisfying about working out a fast, vectorized implementation of a classic algorithm. The notebooks have been re-purposed from the course's homework as well as a few explorations of my own.