Data Projects — Derek Zhao

data science projects

Below are some projects I've worked on. Each one's been an invaluable and eye-opening learning experience. I'm looking forward to the many more to come.

Technologies: Tensorflow, Scikit-image, Scikit-learn, ImageIOConcepts: Explainable AI, Dimensionality Reduction, Semi-supervised Learning, Variational Autoencoder, 2D Convolutions — **Technologies**: Tensorflow, Scikit-image, Scikit-learn, ImageIO
**Concepts**: Explainable AI, Dimensionality Reduction, Semi-supervised Learning, Variational Autoencoder, 2D Convolutions

Supervised Variational Autoencoders

In this group project for Columbia’s Applied Deep Learning course, my classmates and I attempted an experimental architecture that combines the variational autoencoder with traditional supervised learning. We found that the inclusion of variational layers to image classification architectures regularizes the model and improves image classification performance.

Technologies: Tensorflow, Kivy, Scikit-image, Scikit-learn, ImageIO

Concepts: Explainable AI, Anomaly Detection, Dimensionality Reduction, Variational Autoencoder, 2D Convolutions

Neural Network Explainability

Neural networks, while powerful, are notorious for their opaqueness; they’re black boxes that produce predictions without much insight into the "reasoning” behind the prediction. This motivates a broad field of research into Explainable AI (XAI). My summer intern project at NASA Langley Research Center looked into how Variational Autoencoders can be modified to produce interpretable dimensionality reduction on image data and was later published as a technical paper.

Technologies: Spotipy, Beautiful Soup, Requests, NetworkX, Gensim, Scikit-learn, NumPy, SciPy, Pandas, Matplotlib, SeabornConcepts: Web Scraping, Regular Expressions, Word2vec, TSNE, Naive Bayes, Latent Dirichlet Allocation — **Technologies**: Spotipy, Beautiful Soup, Requests, NetworkX, Gensim, Scikit-learn, NumPy, SciPy, Pandas, Matplotlib, Seaborn
**Concepts**: Web Scraping, Regular Expressions, Word2vec, TSNE, Naive Bayes, Latent Dirichlet Allocation

Spotify Song Lyric Analysis

This is a comprehensive analysis of almost six decades of mainstream American music. With data combined from the annual Billboard Hot 100 rankings, lyrics scraped from various web sources, and proprietary audio features curated by Spotify, the dataset was rich with possibilities. In addition to performing some traditional analysis like visualizing patterns over time or exploring various correlation structures, I found answers to questions like, "What lyrics are most typical of each decade?" The answer isn't pretty.

Technologies: Keras functional API, Gensim, Argparse, Scikit-learn, NumPy, Matplotlib, PandasConcepts: Deep Neural Networks, Manhattan LSTMs, Siamese Neural Networks, GloVe, TF-IDF — **Technologies**: Keras functional API, Gensim, Argparse, Scikit-learn, NumPy, Matplotlib, Pandas
**Concepts**: Deep Neural Networks, Manhattan LSTMs, Siamese Neural Networks, GloVe, TF-IDF

Extending DeepER: Deep Learning for Entity Resolution

Entity resolution is a difficult problem in data management. For example, given two datasets, one of products from Amazon and another of products from Walmart, how can we tell when both datasets are referencing the same product if they use different naming conventions and item descriptions? DeepER is a framework that leverages deep learning and word embeddings to address these issues, and this project examines various methods for extending DeepER's capabilities.

Technologies: Surprise, Hyperopt, FastFM, SciPy, Scikit-learn, Amazon Python API, IMDB Python API, OMDB Python API, NumPy, Matplotlib, Pandas, SeabornConcepts: Locality Sensitive Hashing, Probabilistic Latent Semantic Indexing, Factorization Ma… — **Technologies**: Surprise, Hyperopt, FastFM, SciPy, Scikit-learn, Amazon Python API, IMDB Python API, OMDB Python API, NumPy, Matplotlib, Pandas, Seaborn
**Concepts**: Locality Sensitive Hashing, Probabilistic Latent Semantic Indexing, Factorization Machines, Bayesian Optimization, Alternating Least Squares, Unconstrained Matrix Factorization, Nonnegative Matrix Factorization, Collaborative Filtering, Gradient Boosted Trees

Amazon Film & TV Recommendations

This final group project for Brett Vintch's Personalization Theory course surveys the effectiveness of various classic collaborative filtering and matrix factorization algorithms for predicting consumers' ratings for products in Amazon's Film & TV catalogue based on their past ratings. We also implemented algorithms from scratch, including PLSI (probabilistic latent semantic indexing) and an approximate nearest neighbors method called cosine-based LSH (locality sensitive hashing).

Technologies: HTML, CSS, JavaScript, Flask, Keras, Scikit-learn, SpaCy, Gensim, NumPy, Matplotlib, Pandas, SeabornConcepts: Web Development, 1-D Convolutional Neural Networks, Word2Vec, Naive Bayes, TF-IDF, Regular Expressions, Lemmatization — **Technologies**: HTML, CSS, JavaScript, Flask, Keras, Scikit-learn, SpaCy, Gensim, NumPy, Matplotlib, Pandas, Seaborn
**Concepts**: Web Development, 1-D Convolutional Neural Networks, Word2Vec, Naive Bayes, TF-IDF, Regular Expressions, Lemmatization

TweetRater

This personal project completed right before starting my master's program examines tweets and classifies them as either inoffensive, offensive, or hate speech. I compared the classification performance of a Naive Bayes model, vanilla neural network, and a 1-D convolutional network trained on word embeddings and found the convolutional network did best. I also built a simple web-app where you can write your own tweets and explore how the model reacts.

Technologies: Scikit-learn, NumPy, Matplotlib, Pandas, SeabornConcepts: Soft Voting Ensembles, SVMs, Naive Bayes, Logistic Regression, Adaptive Boosting, Recursive Feature Elimination, Social Desirability Bias, TSNE — **Technologies**: Scikit-learn, NumPy, Matplotlib, Pandas, Seaborn
**Concepts**: Soft Voting Ensembles, SVMs, Naive Bayes, Logistic Regression, Adaptive Boosting, Recursive Feature Elimination, Social Desirability Bias, TSNE

Likely Voter Prediction

This was my first data science project. Following the 2016 presidential election and the failure of political pollsters to accurately capture the national mood, I was curious how they go about deciding whether a respondent is a likely voter. Much to my surprise, pollsters were still using heuristics developed decades ago and the literature on using machine learning to build likely voter models was extremely thin. I trained an ensemble classifier to predict a respondent's chance of voting and assessed its performance as a likely voter model against a baseline with no such model.

Technologies: Python, NumPy, MatplotlibConcepts: Linear Regression, Adaptive Boosting, K-nearest Neighbors, K-means, Gaussian Processes, Markov Chains, Semi-supervised Learning — **Technologies**: Python, NumPy, Matplotlib
**Concepts**: Linear Regression, Adaptive Boosting, K-nearest Neighbors, K-means, Gaussian Processes, Markov Chains, Semi-supervised Learning

Implementations of Machine Learning Algorithms

I didn't realize I loved implementing machine learning algorithms from scratch until I took John Paisley's Machine Learning class at Columbia. There's something incredibly satisfying about working out a fast, vectorized implementation of a classic algorithm. The notebooks have been re-purposed from the course's homework as well as a few explorations of my own.