You are on page 1of 2

ROLE AND RESPONSIBILITIES • Assignment 5 - Predict airline arrival delay

final
• Building data pipelines to bring together information • Tree based model: label encode –
from different source systems using Python and Logistic regression – fit 1 and 0, what
NodeJS is likely
• Data pipeline: • Non tree-based: one hot
• Extract raw data from different • Naïve bayes - likelihood
sources to data warehouse • Decision tree – label, find
• In staging area, tranform data (clean which decision - overfit
data, put it into right data type, • Random forest – lots of
check NULL (remove or fill), check decision tree
duplicate, check invalid data value,…) • Gradient boosting –
• Load into analytics area combine of a lot of classifiers
• Maintain data pipeline, automate • Light GBM
tasks • XGBoost
• Integrate, consolidate and cleanse data and structure • Support vector: find vector
it for use in analytics applications with the least distance
• Data schema of email marketing data: • Assignment 4 - House price predict with
• 3 tables: receivers, products, emails Linear Regression, Polynomial Regression
• Receivers: name, age, job, and PCA (FINAL)
income,… • Linear regression (find function of 2
• Products: product we market orders to fit) with cross validation
for – industry, type of (cross the train and test), gridsearch
customers, number of cv (how many parts to cross)
customers each month, price • Linear regression with polynomial
of product regression (3 order of function)
• Emails: content, subject, kind • Linear regression with PCA (reduce
of campaign dimension)
• Click: click through rate, • Linear regression with PCA and
conversion rate, device, Polynomial Regression
position, gender… • Work on the automation of data-flows using Python.
• Analyze incoming data sets using Python Pandas • Extract data from API by installing requests
• Assignment 2 - Influence on Job Satisfaction library
(FINAL) • Build pipeline with Pyspark
• Check data: dtypes, info, describe, • Create spark session
drop -> pandas_profilling • Read table from db using Spark jdbc
• Fill missing value with mean or 0,… • Define ransform, aggregate data
remove outlier functions similar to pandas functions
• Visualize: matplotlib, seaborn • Define main function
• Preprocessing: label encoding, one • Use Airflow to schedule
hot encoding • Drive data management improvements through
• hypothesis testing features/stories
• t-test
• anova QUALIFICATIONS AND EDUCATION REQUIREMENTS
• chi-square
• correlation • Fluent in conversational English: 7.5 IELTS
• Nominal -> ordinal -> • Good understanding of ETL tools and REST-
interval -> ratio oriented APIs for creating and managing data
• Assignment 6 - Clustering Boston Marathon integration jobs
groups final • ETL tools: experience with Hadoop
• Kmean (w and wout PCA, elbow ecosystem, Pentaho (integrated in
method to find less squared Hadoop),
distance, find silhouette_score – • Create, maintain Nifi tasks (get
greater is better) – but for number raw data from sources such as
• K-prototype (kmodes) (w and wout Oracle, FTP, API. Update
PCA) – for cate attribute, put into rawzone in
HDFS), Pentaho tasks (get from
HDFS to a temporary table, • Restful: REpresentational State Transfer, set of
transform data, output data to functions as GET (read), PUT (update), POST
the right type in official table) (create), DELETE (delete) equivalent with CRUD,
• REST-oriented APIs: Use API to get data of HTTP methods, to communicate, instead of
in Nifi use URL
• Basic auth: client send user, • API: Application program interface - contract
password, grant type to server - between the consumer of information (calling
> if correct, server send access API) and the provider (the response), return JSON
token to client or XML
• Use this use this access token • RESTful API: standard for designing API for web
(bearer <access token>) with application to manage resources (files, pics,
generated signature, send to audios, videos…)
server -> get data content from • Appication -> API -> Client
server
• Use Yahoo Finance API with NodeJS
Python to get data of FOREX:
simple Python code: import • Framework based on JS, to build web application
library -> code: .download (forex • Fast processing with non blocking I/O, with
code, start date, end date, thousands of connections at one time
options) • Scalable
• Some experience using SQL or similar data
processing tools: Use Pentaho to data Audience Serv:
processing, use Dbeaver to connect, write, test
code • winner of Deloitte Technology fast 50 Germany
• Experience in process automation using 2020 & 2021
MongoDB, Mysql, Python, NodeJS: know Pyspark • 6 offices, head quarter in Berlin
• Experience in AWS, GCP, data warehouses and • 12 years of experience, since 2008
data lakes is a plus: get raw data put in data lake, • 23 markets
then extract, transform data to data warehouse • Provide marketing solutions (esp email), using
for DA team. ML, AI
• Analytic/data-oriented mindset: have experience • Mission: exceed client expectations of both
with data analyst (teaching assistant), and data service levels and campaign outcomes
sciencist (internship for FOREX trading), have
some projects on using Pandas with skicit learn
to use machine learning to analyze data
• Good team worker, service oriented towards
internal clients.

API

• Use Flask or FastAPI to write API


• Get library
• Define class
• @app.get (<address>)
define what to return
• Authentication: jwt
• Send POST request with username and
password to server
• Server create jwt and send (header –
type, crypto algorithm; payload – data in
jwt registy - reserved, public, private;
signature – unique string to make sure
jwt is integrity)
• Client send jwt token with every request,
if valid then get data
• Message: 200 – ok, 401: invalid

You might also like