Deep Learning

Modeling Customer Behavior

View PDF

View Presentation

View GitHub


s

I. Abstract

Predicting customer behavior is a crucial part of the e-commerce industry for platforms such as Fingerhut to enhance user experience and drive business. Our team leveraged the extensive dataset provided to us from Fingerhut to focus on predicting whether a customer will complete a “journey” on the company’s web page; defining a successful journey as one where the customer reaches the ‘order shipped’ event stage. After our team performed meticulous feature engineering and cleaning on the dataset, we evaluated and trained the data on several different models, such as Logistic Regression, Gradient Boosting, Neural Networks, XGBoost, and Decision Trees, finding the XGBoost model to be the most effective, achieving the highest F1 score of 73%. Although our team faced many limitations due to the extensive size of the dataset as well as our limited computing power, we believe our results to be of great value to the Fingerhut team and to have the potential to be leveraged to learn more about their customer’s behavior and evidently drive business in the coming years.

Learning Equality: Curriculum Recommendations

Given topics teachers want to teach about spanning 29 different languages, I constructed a Siamese Neural Network to match educational contents based on their natural language inputs in a multi-class, multi-label classification problem.

Link to GitHub Page

Link to Kaggle Competition Page

Abstract

The goal of this project was to predict content matches for various educational topics from across the globe. Using the natural language titles, descriptions, and other features related to each topic and content, I preprocessed the text to remove stopwords, punctuation, and whitespaces then vectorized the text into integer vectors of fixed length with a vocabulary of 1,000,000 words associated to unique numbers. After embedding and pooling those vectors, I compared each topic-content pair through an additional neural network that determines whether the vectors were associated with each other (Siamese Network). For each topic in the topics dataset, I compared the topic to all 150,000+ contents and selected the best contents based on a threshold found from an ROC curve.