Data science and analytics projects, 2019-2023
In recent years, online customer reviews or product reviews have become the major source for collecting feedback from customers. Restaurant owners could find ways to improve their service or even evaluate whether a new store is easy/hard to gain good ratings and reviews in a specific location by analyzing those online reviews. In this project, we want to explore the factors that cause regional bias in restaurants’ ratings. We chose Starbucks as the target store to analyze since it is our favorite and the most popular chain store in U.S.A. We are curious about why the same brand with almost the same products could receive different ratings in different locations. The analysis will first focus on boroughs in New York City due to the time limit and the project scope. By analyzing the online review text with regional data (i.e. income levels) of Starbucks in different regions, we want to answer the following two questions:
- What factors do customers care about Starbucks?
- If those factors varied by region? If so, do income levels make them different? The analysis result of this case study will provide the business owners the insight into how to satisfy the customers and run a business successfully in each region.
- Constructed joke recommendation systems based on the similarity among users’ ratings (e.g., user-item interactions) using collaborative filtering, mean rating recommendation, autoencoders, and variational autoencoders (VAE)
- Evaluated performance by NDCG and hit rate @ 5 and @10, witnessing SVD with the highest NDCG and hit rate of 0.6
- Tobacco use has an evidence-based association with many chronic diseases such as cancer, cardiovascular disease, mental disorders, etc. Although combustible cigarette use has declined in the US, the percentage of some users of modern tobacco products such as e-cigarette users who initiated at a young age is 3 times higher in 2018 than 2014. Therefore, more attention should be focused on modern tobacco products with increasing prevalence.
- However, most machine learning papers only focus on traditional combustible cigarette use. This motivated us to use machine learning methods to classify an individuals use status of modern tobacco products (Never user, Former user, and Current user) based on their use of other tobacco products combined with sociodemographic factors including age, ethnicity, and marital status.
- Our results show that the Kernel Ridge Regression yields the best performance when compared with other machine learning methods when evaluated using MSE and AUC scores. Future work includes the application of these machine learning methods to a more balanced dataset and the inclusion of more information-rich features.
-
As is a contagious disease caused by a virus, the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), Coronavirus disease 2019 (COVID-19) was first identified in Wuhan, China, in December 2019. The disease has since spread worldwide, leading to the ongoing COVID-19 pandemic[1]. Of those people who develop symptoms noticeable enough to be classed as patients, most (81%) develop mild to moderate symptoms (up to mild pneumonia), while 14% develop severe symptoms (dyspnea, hypoxia, or more than 50% lung involvement on imaging), and 5% suffer critical symptoms[2]. Older people are at a higher risk of developing severe symptoms.
-
The goal of this final project is to investigate the potential patterns of COVID-19. We want to examine the following question:
Can COVID-19 be well modeled by SARIMA and/or SEIR models?
- One of the main fatal diseases that threats women’s health is Cervical Cancer, which usually does not present any symptoms in early stages. When some symptoms appear, the patients’ condition might have been worsened and the cancer may have become metastatic. Therefore, early diagnosis of Cervical cancer risk factors can stop its tracks and reduce the mortality rate and the associated complications.
- To seek for more accurate earlier diagnosis, we proposed a study of Cervical cancer diagnosis based on machine learning classification using Support Vector Classifier, Kernel Ridge regression, Logistic Regression, Lasso Regression, Random Forest Classifier, Gradient Boosting Classifier, and Multilayer Perceptron. The dataset has around 30 variables describing over 800 patients’ demographic information, habits, and historic medical records. It includes four targets (Hinselmann, Schiller, Cytology, and Biopsy). The four targets are used as the measurement for cancer predictions in this project. We also used SMOTE method to handle the imbalanced dataset. Evaluation metrics include Mean Squared Error, R squared, precision, recall, F1 score, and Area Under Curve.
- Our results show that the Multilayer Perceptron yields the best performance for Hinselmann, and the Gradient Boosting Classifier yields the best performance for the other three targets when compared with other machine learning methods. Future work includes the application of these machine learning methods to a more balanced and real large scale hospital dataset and the inclusion of more informationrich features.
- Tested multiple training and test filters to filter out non-answers, with a success rate above 90%
- Pre-processed text data by stemming words, removing stop words and punctuations, making lowercase, and including "tuples"
- Calculated word frequency by TFIDF
- Explored multiple unsupervised clustering algorithms such as K-means, affinity clustering, and GMM etc.
- Projected clusterings to 2D space and printed out N important terms for each cluster
Project 7: How is diabetes readmission rate related to patients’ pathologic conditions and medications?
- Constructed several GLMs and GLMMs and chose the one with the best predictability, relatively good inferences and scalability
- Explored the best diabetes readmission rate predictors (“readmitted”)
- Analyzed the rudimentary information of patients, medications and laboratory tests taken during the diabetic encounter to identify patients with worse treatment outcomes and made them targeted to interventions to improve their outcomes and reduce costs by fewer readmission
- Performed diagnostic check
- Interpreted the final model and discussed limitations and potentials
- Applied shapes to water condition improvement methods for visual representation
- Designed and formatted the charts, and set up tool tips
- Arranged objects in containers
- Designed the layout for the mobile app and the website
Experimental design: how does the distance between the hand and the head of the chopstick and the side of hands affect the total time needed to pick up all red beans in a bowl by chopsticks?
- Used factorial design, performed 20 experiments in total, two distances of 15cm and 10cm are set between the thumb and the chopstick head
- Took influence factors into considerations and tried to eliminate other factors
- Constructed linear model, interaction plot, QQ plot, confidence intervals, and T-tests etc.
- Design and Analysis of an Experiment
- Data visualization with the depth of the water and the sound volume
- Re-construct artworks leveraging Python in Grasshopper
- Covid-19 What’s all the fuss? (covid death data 3D visualization project)
- Is systolic blood pressure reading related to gender, age, poverty, weight, sleep trouble, and smoking habit? American population-based study
- Data Structures projects (https://github.com/alisamao09/data-structures)
- Deep Learning for Computer Vision (https://github.com/alisamao09/Deep-learning-for-computer-vision)
- Database Management Systems (https://github.com/alisamao09/Database-Management-Systems#database-management-systems)