You are on page 1of 43

Information Technologies at Finance

BIG DATA ANALYTICS


M. Hamdi Özçelik
Marketing Analytics and Optimization Manager
Fall 2018
YAPIKREDİ’NİN FİKRİ ve SINAİ HAKLARI
Gizlilik:
Taraflar (Üniversite, YapıKredi, Öğrenciler) kendilerine diğer tarafça açıklanan bir gizli bilgiyi(*) ;
• Korumaktan,
• Herhangi bir 3. Kişiye hangi suretle olursa olsun vermemek ve/veya alenileştirmemekten,
• Doğrudan ya da dolaylı olarak aralarındaki eğitim ilişkisinin amaçları dışında kullanmamaktan sorumludur,
• Taraflar ancak zorunlu hallerde ve işi gereği bu bilgiyi, öğrenmesi gereken alt çalışanlarına ve kendilerine bağlı olarak
çalışan diğer kişilere verebilirler ancak bilginin gizliliği hususunda alt çalışanlarını ve kendilerine bağlı olarak çalışan
diğer kişileri uyarırlar ve alt çalışanların sorumluluğunu alırlar.
Haklar:
• Eğitim için YapıKredi eğitmeni tarafından kullanılan tüm materyalin (sunum, demo, yazılım kodu, yayınlar) fikri ve sınai
hakları (başka sahiplik belirtilmediyse, kamuya mal olmamışsa) YapıKredi A.Ş.’ye aittir.
• Eğitim içeriğinin (öğrencilere verilen, anlatılan) çoğaltılması, diğer mecralarda paylaşılması, bunlardan yararlanarak
yayın yapılması YapıKredi’nin iznine bağlıdır.
• Banka tarafından verilen eğitimlerden faydalanarak yapılan çalışmaların makale, konferans sunumu vb yayınlar şeklinde
kullanımından önce YapıKredi Fikri Haklar Sorumlusundan Yayın Onay Formu ile onay alınmalıdır.
• Doğacak her türlü uyuşmazlıkların çözümlenmesinde, Türk Hukukunun uygulanacağını ve İstanbul Mahkeme ve İcra
Daireleri ile Banka'nın Genel Müdürlüğü'nün bulunduğu yerdeki mahkeme ve icra dairelerinin yetkili olacağını,
Kanunen yetkili mahkeme ve icra dairelerinin yetkilerinin saklı olduğunu taraflar kabul ederler.
(*): BANKA’nın sahip olduğu ve olacağı ve iştigal ettiği her türlü ticari, mali, teknik ya da benzeri konulardaki sözlü, yazılı veya manyetik ortamda veya başkaca herhangi bir
şekilde taraflara iletilen ve/veya tarafların öğrenebileceği BANKA’ya ya da BANKA’nın kendi customerslerine ve/veya personeline ait her türlü bilgi, yazılım ve yazılım kodu,
belge, lisanslar ve hizmetlere ilişkin bilgiler ile her türlü manyetik şerit, kartuş, doküman, el kitabı, şartname, sirküler, program listeleri, veri dosyaları ile BANKA’nın
yürürlükteki veya henüz kamuya duyurulmamış lisanslar ve hizmetleri, BANKA’nın iştigal ettiği her türlü hizmet, bunların keşfi, icadı, araştırılması, geliştirilmesi, imali ve
satışı, proses ve genel ticari faaliyetler, satış maliyetleri, kar, fiyatlandırma metotları, organizasyon ve personel listesi dahil olmak üzere her türlü bilgi ve belgeler “Gizli Bilgi”
tanımına girer.

2
2
Instructor
M. Hamdi Özçelik
– Marketing Analytics and Optimization Manager at YKB Retail Banking
– Portfolio Analytics and Optimization Manager at YKB Retail Banking
– Datamining & Analytical Applications Manager at YKB IT
– Innovation & R&D project leader at YKB IT
– Worked as research assistant, sw developer, project manager, sw development
manager, part-time lecturer

– PhD student at Marmara University Industrial Engineering


– M.Sc. from Boğaziçi University Industrial Engineering, 1993
– B.Sc. from Boğaziçi University Electrical & Electronics Eng., 1990

– Expertise on datamining, optimization, analytical CRM, software development,


metaheuristics..

hamdi.ozcelik@yapikredi.com.tr

3
WHAT AGE ARE WE LIVING IN?

• Industrial Age ?
?
• Computer Age ?
• Internet Age ?
• Data Age ?
• Information Age ? Knowledge Age ?
• Communication Age ? Mobile Age ?
• Innovation Age ?

4
VIDEO 1: EXPONENTIAL TIMES

5
SOURCES OF BIG DATA
• The big data challenge comes with its four main dimensions (*)
– Volume
– Velocity
– Variety
– Veracity
• Tsunami of information comes from
– Internet of Things (IoT)
– Unstructured data
– Social Media
– Customer data
– External sources

(*) http://www-01.ibm.com/software/data/bigdata/

6
AGE OF DIGITAL CUSTOMER

Customers Banks

Easy access to info New channels

Digital footprints New data sources

7
HADOOP and MAPREDUCE
Hadoop Datawarehouses
• HDFS: Hadoop Distributed File System
• Not common in most corporates, usually at businesses like Google and Amazon
• Used for log parsing, indexing, batch processing, and most importantly storing huge
unstructured data
• Applications mainly on web log analysis, visitor behavior, image processing, search indexes,
analyzing and indexing textual content, research in natural processing

MapReduce
• A new programming model/paradigm/architecture/framework
• Provides automatic parallelization and distribution, so massive
parallel processing of large datasets across many end nodes
becomes feasible
• Different from relational databases
YARN
• A resource management framework for scheduling and handling resource requests from distributed
applications.

8
BIG DATA – BIG OPPORTUNITY
• Paradigm shift: "lots of data as a problem" to "lots of data as an opportunity".

• Data is the new "currency of competition"

• Firms that adopt data driven decision making (DDD) have output and productivity that is
5-6% higher than what would be expected given their other investments and information
technology usage (*).

• Without data, analysis is not possible


• Without analysis, DDD is not possible

(*) Brynjolfsson, Erik, Hitt, Lorin M. and Kim, Heekyung Hellen,


Strength in Numbers: How Does Data-Driven Decision Making Affect Firm Performance? (April 22, 2011).
http://ssrn.com/abstract=1819486 or http://dx.doi.org/10.2139/ssrn.1819486

9
Volume BIG DATA Variety
Distributed file system-Hadoop Sandboxes
In-database analytics Information discovery

Structured Unstructured
Mostly internal External
Mostly offline Mostly online
Known identity May be anonymous
Own channels Social Media
Transactional Interactions
credit Demographics Location data
bureau Financial bahavior Every behavior

Velocity Veracity
Real Time Decisions Data governance
Complex Event Processing • privacy
• security
• quality
• metadata
Targeted Actions

BIG VALUE
FROM DATA TO VALUE

Analyze& Validate Synthesize Use


Diagnose

Expert opinions Hypothesis Testing Datafication Innovative


Data driven insights Justification KPIs Cross domain

11
VIDEO 2: EXPLAINING BIG DATA

12
WHAT IS A DECISION?

13
KIND OF DECISIONS

14
COMPARISON OF DECISION TYPES

15
BUSINESS ANALYTICS

«Data Poor, Insight Rich» is much better than «Data Rich, Insight Poor» !!

16
BIG DATA – BUSINESS ANALYTICS
• Business Analytics = Extracting value from the new digital wealth surrounding us.
• The ease of capturing big data’s value, and the magnitude of its potential, vary
across sectors:

Source: McKinsey Global Institute

17
HIGH PERFROMANCE ANALYTICS - HPA

Grid computing: distribute the workload among several


computing engines (optimize CPU utilization)

X In-database analytics: move the analytics process


closer to the data (no data duplication, minimize
data integration effords)

In-memory analytics: distribute the workload and data


alogside the database (minimize disk I/O)

18
VIDEO 3: BIG DATA WILL CHANGE OUR WORLD

19
DATA SCIENCE

Data Science is the art of turning data into actions.

A Data Product provides actionable information without exposing decision makers to


the underlying data or analytics. Examples include:

• Movie Recommendations
• Weather Forecasts
• Stock Market Predictions
• Production Process Improvements
• Health Diagnosis
• Flu Trend Predictions
• Targeted Advertising

20
INVENTING USE CASES: CREDIT CARD BEHAVIOR
Customer behavior detected: The customer uses her credit card only at the first half of
the statement period

Analyze& Validate Synthesize Use Case 1 Use Case 2


Diagnose

Behavioral pattern Credit bureau data Metric Detect fraud Sell another card

21
USING DATA = PLAYING AT SAND

dirty simple art

22
IMPORTANT ISSUES WITH BIG DATA ANALYTICS
1) OVERENGINEERING

2) ANALYTICAL TALENT NEED

3) PARALYSIS BY ANALYSIS

4) ANALYSIS IS NOT ENOUGH

5) DATA WASTE

23
ISSUE 1: OVERENGINEERING
Weight: 220 tons
Speed: 13 km/h
• Results in Milage: 62 km
Armament: 32 rounds
– Overcomplexity Fuel Tank: 2,700 lt
– Higher costs
– Low usability
– High maintenance costs

• Design principles
– “Simple is best”
– “less is more”
– “Balanced” design
– Design as simple as possible, but not simpler

• Brut Force <-> Engineering Approach <-> Overengineering

• Engineering: Value creation by technology

"That’s been one of my mantras: focus and simplicity. Simple can be harder than complex: You
have to work hard to get your thinking clean to make it simple. But it’s worth it in the end because
once you get there, you can move mountains." Steve Jobs

24
DESIGNING THE SIMPLE ONE
When you first start off trying to solve a problem, the first solutions
you come up with are very complex, and most people stop there. But if you
keep going, and live with the problem and peel more layers of the onion off,
you can often times arrive at some very elegant and simple solutions.
Steve Jobs

25
ISSUE 2: ANALYTICAL TALENT NEED

• Overall US demand for information services is expected to exceed $600 billion by


2015 (*).

• In the United States alone, a research (**) shows, the demand for people with
deep analytical skills in big data could outstrip current projections of supply by 50
to 60 percent.
– By 2018, as many as 140,000 to 190,000 additional specialists will be required.
– Also an additional 1.5 million managers and analysts are needed who has a “sharp”
understanding of “how big data can be applied”.

(*) Accenture
(**) McKinsey Quarterly

26
HOW TO BE AN ANALYTICAL TALENT?
• Multidiscipliner education/training at areas:
– Database
– Programming
– Mathematics
– Statistics
– Operations research
– Sociology
– Physycology
– Engineering Economics
– Behavioral Economics
– Marketing
– Finance
– Artificial Intelligence
– Neuroscience
– Game Theory
– Machine Learning
– ….

27
DATA SCIENTIST

• Curiosity
• Creativity
• Focus
• Attention to detail
Domain
Expertise
Information Technologies (IT) needs to spend;
• more time on the «I»
Computer • less on the «T».
Math
Science

New IT people should be «I shaped»,


Not «T-shaped»!

28
ISSUE 3: PARALYSIS BY ANALYSIS

• It occurs when the cost of decision analysis exceeds the benefits that could be gained by
enacting some decision.

• It is more common with;


– Big data, information overload
– Dirty data, chaotic environment
– Non deterministic situations
– Lack of experience at value creation
– Poor IT / business alignment & bureaucracy

• The cures:
– Good prioritization
– Using stochastic methods
– Using simpler methods
– Understanding the business, the customer and the data
– Common goal: value creation
– Doing more than “the analysis”!

29
ISSUE 4: ANALYSIS IS NOT ENOUGH

• Doing analysis is not enough. Analytic propositions are not useful alone.

• To create business value, synthetic propositions should be created. Immanuel Kant


named them as “augmentative judgments” (*):

“For although analytical judgments are highly important and necessary, they are so,
only to arrive at that clearness of conceptions which is requisite for a sure and
extended synthesis".

• Synthesizing every bit of information into useful models is an art!

"All models are wrong, some are useful“ – George Box

(*) “The Introduction to the Critique of Pure Reason” (1781/1998, A6-7/B10-11)

30
ISSUE 5: DATA WASTE
It occurs when we do not fully utilize the wealth of data that we already have.

The main reasons are


• There are more disparate systems in place now than ever before
• Systems are being used in richer, more diverse ways
• Total data volume has grown exponentially
• The volume of unstructured data has grown especially fast

Data waste means


• Missed opportunities
• Weaker customer relationships
• Poor strategic and tactical decisions
• Legal and regulatory risk
• Higher operating costs

The way to tap this value is called as «information optimization»


It should address both the accessibility of data and its use at analytics.

31
VIDEO 4: BIG DATA FOR SMARTER CUSTOMER EXPERIENCES

32
DEEP LEARNING

• Artificial intelligence mimics people’s cognitive acitivities such as thinking,


analysing, and decision making.

• The originator of artificial intelligence is known as Alan Turing. He led the


discussion on artificial intelligence with an interesting question: “Can
machines think?”

• Deep learning technology is highly utilized in the IT industry for a wide range
of applications such as computer vision, sound recognition, and natural
language processing. Google, Microsoft, and Apple are some of the leading
companies in deep learning applications.

• Artificial neural networks explore relations in the data by composing basic


functions and building new layers.

• Deep learning is the artificial neural network with a large number of layers;
however, it starts to recognize the pattern starting from the lowest layer and
forms a prototype.

33
MODELING – SIMPLE START
Selection Rules
Prior Probability
The ratio of # of elements with a desired property to the # of elements at the whole set.
Example: : Lets assume we have 10 customers and 4 of them have casco insurance. The ownership ratio
becomes 40%. In other words if we select a customer randomly, the probability of having a casco
insurance for the selected customer is 40%.

Selection Rule
A selection rule divides the set into two subsets. The average rate at these subsets could be different than
the whole set. Such rule based splitting could be illustrated by decision trees as branchs and leaves or
could be shown in decision tables:

# customers = 10 Not casco owner


# casco owners = 4 Casco owner
Car ownership Count Casco owners Ownership ownership = 40%
1 6 4 67%
0 4 0 0%
10 4 40%

Car owners Not car owners

# customers = 6 # customers = 4
# casco owners = 4 # casco owners = 0
ownership = 67% ownership = 0%
Select Criteria

prior posterior
probability Selection probability

posterior probability
Lift =
prior probability

Lift shows that the density of selected cluster in terms of ownership with respect to the
common overall average. If a random selection is made, the ownership would be same as
the overall, so the lift comes out to as 1. This ownership shows how intense the selected
audience is based on that feature. It can be at least 0, and its maximum value is max_lift.
1
max_lift = 0<= Lift <= max_lift
prior probability

Gain also shows how much density has been created in addition to the overall average:
Gain = Lift - 1 -1 <= Gain <= max_lift - 1
Decision Tree - 1
# customers = 10
# casco owners = 4
ownership = 40%

Car owners Not car owners

# customers = 6 # customers = 4
# casco owners = 4 # casco owners = 0
ownership = 67% ownership = 0%

Female Male Female Male

# customers = 3 # customers = 3 # customers = 2 # customers = 2


# casco owners = 3 # casco owners = 1 # casco owners = 0 # casco owners =
ownership = 100% ownership = 33% ownership = 0% ownership = 0%

Car ownership Sex Count Casco owners Ownership lift gain


1 Female 3 3 100% 2,50 1,50
1 Male 3 1 33% 0,83 -0,17
0 Female 2 0 0% 0,00 -1,00
0 Male 2 0 0% 0,00 -1,00
Grand Total 10 4 40% 1,00 0,00
Decision Tree - 2
# customers = 10
# casco owners = 4
ownership = 40%

Female Male

# customers = 5 # customers = 5
# casco owners = 3 # casco owners = 1
ownership = 60% ownership = 20%

Car owners Not car owners Car owners Not car owners

# customers = 3 # customers = 2 # customers = 3 # customers = 2


# casco owners = 3 # casco owners = 0 # casco owners = 1 # casco owners = 0
ownership = 100% ownership = 0% Ownership = 33% ownership = 0%

Sex Car ownership Count Casco owners ownership lift gain


Female 1 3 3 100% 2.50 1.50
Female 0 2 0 0% 0.00 -1.00
Male 1 3 1 33% 0.83 -0.17
Male 0 2 0 0% 0.00 -1.00
Grand Total 10 4 40% 1.00 0.00
Decision Tree - Branching
# customers = 10
# casco owners = 4
ownership = 40%

İstanbul Other cities

# customers = 5 # customers = 5
# casco owners = 2 # casco owners = 2
ownership = 40% ownership = 40%

İstanbul Count casco owners ownership lift gain


1 5 2 40% 1,00 0,00
0 5 2 40% 1,00 0,00
Grand Total 10 4 40% 1,00 0,00

• If the number of elements in a leaf in the tree is too low, the measurement is not statistically
secure. Unreliable measurements cause memorization rather than learning from the data, and
we see that the values we expect are not implemented when applied to new data.
• If the average ownership ratio of the leaves formed in a branch is close to each other, this
branching will not produce any benefit, and will even reduce the reliability of subsequent
branches because it divides the mass into small pieces.
A/B Testing
The A/B testing is an experiment based on the application of multiple versions of a concept on different
randomly defined subsets of the target population. The responses of target subsets to different versions
of the scope are used to measure the success of these versions.

For the A/B testing to be successful, the subsets exposed to different versions must be of sufficient size.
Otherwise, the results obtained from the test will not be meaningful.

Alternative
Base (challenger)
(champion) model
model

Analyse the data, Create alternatives, Implement the


build the model. compare the winning model.
performances.

Repeat
DON’T SETTLE !

Your work is going to fill a large part of your life, and the
only way to be truly satisfied is to do what you believe is great
work. And the only way to do great work is to love what you do.
If you haven’t found it yet, keep looking. Don’t settle. As with all
matters of the heart, you’ll know when you find it. And, like any
great relationship, it just gets better and better as the years roll
on. So keep looking until you find it. Don’t settle!

Steve Jobs
Commencement speech at Stanford University, June 2005

41
Thank you!

42
YAPIKREDİ’NİN FİKRİ ve SINAİ HAKLARI
Gizlilik:
Taraflar (Üniversite, YapıKredi, Öğrenciler) kendilerine diğer tarafça açıklanan bir gizli bilgiyi(*) ;
• Korumaktan,
• Herhangi bir 3. Kişiye hangi suretle olursa olsun vermemek ve/veya alenileştirmemekten,
• Doğrudan ya da dolaylı olarak aralarındaki eğitim ilişkisinin amaçları dışında kullanmamaktan sorumludur,
• Taraflar ancak zorunlu hallerde ve işi gereği bu bilgiyi, öğrenmesi gereken alt çalışanlarına ve kendilerine bağlı olarak
çalışan diğer kişilere verebilirler ancak bilginin gizliliği hususunda alt çalışanlarını ve kendilerine bağlı olarak çalışan
diğer kişileri uyarırlar ve alt çalışanların sorumluluğunu alırlar.
Haklar:
• Eğitim için YapıKredi eğitmeni tarafından kullanılan tüm materyalin (sunum, demo, yazılım kodu, yayınlar) fikri ve sınai
hakları (başka sahiplik belirtilmediyse, kamuya mal olmamışsa) YapıKredi A.Ş.’ye aittir.
• Eğitim içeriğinin (öğrencilere verilen, anlatılan) çoğaltılması, diğer mecralarda paylaşılması, bunlardan yararlanarak
yayın yapılması YapıKredi’nin iznine bağlıdır.
• Banka tarafından verilen eğitimlerden faydalanarak yapılan çalışmaların makale, konferans sunumu vb yayınlar şeklinde
kullanımından önce YapıKredi Fikri Haklar Sorumlusundan Yayın Onay Formu ile onay alınmalıdır.
• Doğacak her türlü uyuşmazlıkların çözümlenmesinde, Türk Hukukunun uygulanacağını ve İstanbul Mahkeme ve İcra
Daireleri ile Banka'nın Genel Müdürlüğü'nün bulunduğu yerdeki mahkeme ve icra dairelerinin yetkili olacağını,
Kanunen yetkili mahkeme ve icra dairelerinin yetkilerinin saklı olduğunu taraflar kabul ederler.
(*): BANKA’nın sahip olduğu ve olacağı ve iştigal ettiği her türlü ticari, mali, teknik ya da benzeri konulardaki sözlü, yazılı veya manyetik ortamda veya başkaca herhangi bir
şekilde taraflara iletilen ve/veya tarafların öğrenebileceği BANKA’ya ya da BANKA’nın kendi customerslerine ve/veya personeline ait her türlü bilgi, yazılım ve yazılım kodu,
belge, lisanslar ve hizmetlere ilişkin bilgiler ile her türlü manyetik şerit, kartuş, doküman, el kitabı, şartname, sirküler, program listeleri, veri dosyaları ile BANKA’nın
yürürlükteki veya henüz kamuya duyurulmamış lisanslar ve hizmetleri, BANKA’nın iştigal ettiği her türlü hizmet, bunların keşfi, icadı, araştırılması, geliştirilmesi, imali ve
satışı, proses ve genel ticari faaliyetler, satış maliyetleri, kar, fiyatlandırma metotları, organizasyon ve personel listesi dahil olmak üzere her türlü bilgi ve belgeler “Gizli Bilgi”
tanımına girer.

43

You might also like