You are on page 1of 1

Mark S.

Wang
markwang1040@outlook.com | +1 (312) 342 4692 | https://www.linkedin.com/in/shmarkwang/

EDUCATION
The University of Chicago | United States August 2019 - March 2021
Master of Science in Analytics, a data science master’s degree
University College London | United Kingdom September 2016 - June 2019
Bachelor of Engineering in Mechanical Engineering with Business Finance and Economics
WORK EXPERIENCES
ShipBob Inc.| Chicago, IL Nov 2021 –
Data Scientist III, Advanced Analytics
• Project: CareBob, churn prediction; project co-owner and sole contributor; models provide signal up to 2 months in advance;
identifies ~100 at-risk merchants weekly (worth $5MM p.a.) and covers churn prediction of ~1,000 merchants ($25MM p.a.).
• Training Data: Manually labeled 600 major merchants (>100K as per stakeholders), based on modified business definition of
churn to guarantee practical value of predictions by giving specialists time to intervene effectively.
• Model Architecture: A merchant profile model (tenure, size, etc.) and a behavior model; predictions pooled; CatBoost model.
• Model Performance: Achieved 80%+ precision on test set and 98% precision in production, validated by subject matter experts;
model is precision-optimized given limited resources in the customer care team available for reaching out to customers.
• Intelligent Workflow: Built pipeline to feed churn risks (healthy, needs attention or at risk) to Salesforce weekly, and reassign
unhealthy customers to dedicated Salesforce case queues for cases to be prioritized and churn risk reduced as much as possible.
• Reporting: Built a 5-page weekly-refreshed PowerBI report to help stakeholders visualize KPIs, customer behavior, etc.
Anthem Inc.| Chicago, IL May 2021 – Nov 2021
Data Scientist Analyst
• Project: Digital Examiner (DEX), an automated claims processing pipeline consisting of data ingestion, data preprocessing,
feature engineering, predictive scoring with machine learning models, predictions ranking and postprocessing of predictions.
• Pipeline Owner: In charge of curating documentation and reviewing changes made by various project sub-teams, merging code,
testing code and pushing code to Bitbucket for each release of the project; the pipeline stands at 10K+ lines of code.
• Model Training: In charge of improving and retraining hundreds of binary classification models as part of the project; in the
latest release in August 2021, improved claims processing automation rate by 77% and prediction precision by 22% of models
relevant to approx. 12% of all claims passing through the pipeline daily (savings in the range of millions), through rigorous
hyperparameter tuning, feature selection and engineering. Reduced the average number of features per model from around 200 to
60, consequently achieving much more stable model performances in production. Recognized internally at enterprise level.
University of Chicago | Chicago, IL Jan 2021 – Mar 2022
Teaching Assistant and Grader; Courses: Machine Learning, Data Mining, Big Data Platforms, Programming for Analytics, Statistical Concepts, etc.
The Medici Group | New York, NY (Remote) July 2020 – Mar 2021
Data Scientist, part-time; converted from summer internship
• In charge of all data-related tasks; commended for ability in navigating ambiguity and turning high-level requirements into specs in
an environment where colleagues do not come from data science, data engineering or computer science backgrounds
• Projects: moved transcription to AWS speech to text, saving $500K for 2021; redesigned relational database schema (3NF galaxy)
to achieve higher integrity and cost saving; engineered fields to collect data on based on customer success needs; etc.
PROJECTS
School Project – Song Popularity Prediction (Regression with NLP) January 2020 – March 2020
• Use Case: Predicted songs’ popularity with music-related features obtained from Spotify API and lyrics from Genius Lyrics API,
to help client more quickly make better decisions on song demos, especially given large demos volume
• Data: Final dataset contains 500k rows; the target variable, song hotness, is a float, [0, 1]
• Modelling: Diverse base models are built and best hyperparameters chosen; then, a stacked model consisting of XGBoost,
Random Forest and Elastic Net is selected out of many combinations of base models, significantly improving results
• Feature Engineering: Imputed, dropped and one-hot encoded data where necessary using domain knowledge, e.g. time
signature and tempo distribution by genre; MAE at 0.185, compared to 0.201 when training with .dropna(). Engineered features
from lyrics using Latent Dirichlet Allocation and word2vec; when used alone, LDA reduced MAE to 0.161, hence chosen over
word2vec which only gives less ideal results and shows signs of overfitting
• Reporting: Feature genre has importance > 0.80 in both XGBoost and Random Forest; to obtain insight in making a hit song in
each genre and drive song editing actions, constructed models with data split by genre; models with less data show variance
problem, but where data has more than 100k rows, results average out to MSE at 0.038 and MAE at 0.153
SKILLS & INTERESTS
Singapore National Youth Orchestra | Singapore Feb 2012 – Dec 2015
Member/Cello Section Leader in Concerts
• Performed in twice-a-year concerts; 7 hours of rehearsal per week; for high commitment and skills, selected to perform at state
events e.g. the Lying-in-State of the founding Prime Minister of Singapore, Ministry of Education Annual Awards Ceremony, etc.
Skills: Python, R, LINUX; SQL; MS Azure, Databricks, GCP, AWS; Tableau, PowerBI; Hadoop, Hive, Spark, etc.; gensim, keras,
Scikit-Learn, H2O, Catboost, LGBM, etc.

You might also like