GPU Implementations of Online Track Finding Algorithms at PANDA

Mitglied der Helmholtz-Gemeinschaft

HK 57.2, DPG-Frühjahrstagung 2014, Frankfurt
21 March 2014, Andreas Herten (Institut für Kernphysik, Forschungszentrum Jülich) for the PANDA Collaboration
1

PANDA — The Experiment

Mitglied der Helmholtz-Gemeinschaft

13 m

Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2

2

PANDA — The Experiment

Magnet STT

MVD

Mitglied der Helmholtz-Gemeinschaft

13 m

Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2

2

PANDA — Event Reconstruction
• Triggerless read out
– Many benchmark channels – Background & signal similar

7/s Event Rate: 2 • 10 •

Raw Data Rate: 200 GB/s Reduce by ~1/1000
Mitglied der Helmholtz-Gemeinschaft

(Reject background events, save interesting physics events)

Disk Storage Space for Offline Analysis: 3 PB/y

3

PANDA — Event Reconstruction
• Triggerless read out
– Many benchmark channels – Background & signal similar

7/s Event Rate: 2 • 10 •

Raw Data Rate: 200 GB/s Reduce by ~1/1000
Mitglied der Helmholtz-Gemeinschaft

GPUs

(Reject background events, save interesting physics events)

Disk Storage Space for Offline Analysis: 3 PB/y

3

PANDA — Tracking, Online Tracking

• PANDA: No hardware-based trigger • But computational intensive software trigger ! Online Tracking

Trigger

Mitglied der Helmholtz-Gemeinschaft

Detector layers

Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2

4

PANDA — Tracking, Online Tracking

• PANDA: No hardware-based trigger • But computational intensive software trigger ! Online Tracking

Trigger

Mitglied der Helmholtz-Gemeinschaft

Detector layers

Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2

4

PANDA — Tracking, Online Tracking
Usual HEP experiment

• PANDA: No hardware-based trigger • But computational intensive software trigger ! Online Tracking

Trigger

Mitglied der Helmholtz-Gemeinschaft

Detector layers

Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2

4

PANDA — Tracking, Online Tracking
Usual HEP experiment

• PANDA: No hardware-based trigger • But computational intensive software trigger ! Online Tracking

Trigger

Mitglied der Helmholtz-Gemeinschaft

Detector layers

Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2

4

PANDA — Tracking, Online Tracking
Usual HEP experiment

• PANDA: No hardware-based trigger • But computational intensive software trigger ! Online Tracking

Trigger

Mitglied der Helmholtz-Gemeinschaft

Detector layers

Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2

4

PANDA — Tracking, Online Tracking
Usual HEP experiment

• PANDA: No hardware-based trigger • But computational intensive software trigger ! Online Tracking

Trigger

Mitglied der Helmholtz-Gemeinschaft

Detector layers

Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2

4

PANDA — Tracking, Online Tracking
Usual HEP experiment

• PANDA: No hardware-based trigger • But computational intensive software trigger ! Online Tracking

Trigger

Detector layers

PANDA

Mitglied der Helmholtz-Gemeinschaft

Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2

4

PANDA — Tracking, Online Tracking
Usual HEP experiment

• PANDA: No hardware-based trigger • But computational intensive software trigger ! Online Tracking

Trigger

Detector layers

PANDA

Mitglied der Helmholtz-Gemeinschaft

Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2

4

PANDA — Tracking, Online Tracking
Usual HEP experiment

• PANDA: No hardware-based trigger • But computational intensive software trigger ! Online Tracking

Trigger

Detector layers

PANDA

Mitglied der Helmholtz-Gemeinschaft

Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2

4

PANDA — Tracking, Online Tracking
Usual HEP experiment

• PANDA: No hardware-based trigger • But computational intensive software trigger ! Online Tracking

Trigger

Detector layers

PANDA

Mitglied der Helmholtz-Gemeinschaft

Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2

4

GPUs @ PANDA — Online Tracking
• Port tracking algorithms to GPU
– Serial ! parallel – C++ ! CUDA

• Investigate suitability for online performance • But also: Find & invent tracking algorithms… • Under investigation:
– Hough Transformation – Riemann Track Finder – Triplet Finder

Mitglied der Helmholtz-Gemeinschaft

Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2

5

Algorithm: Hough Transform
• Idea: Transform (x,y)i ! (α,r)ij, find lines via (α,r) space • Solve rij line equation for
– Lots of hits (x,y,ρ)i and – Many αj ! [0°,360°) each

Hough Transform — Princip

• Fill histogram • Extract track parameters
y y

r

Mitglied der Helmholtz-Gemeinschaft

Mitglied der Helmholtz-Gemeinschaft

! Bin giv
α

x

x

Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2

6

Algorithm: Hough Transform
• Idea: Transform (x,y)i ! (α,r)ij, find lines via (α,r) space rij = cosαj · xi + sinαj · yi + ρi • Solve rij line equation for
– Lots of hits (x,y,ρ)i and – Many αj ! [0°,360°) each

i: ~100 hits/event (STT) rij: 180 000 Hough Transform — Princip j: every 0.2°

• Fill histogram • Extract track parameters
y y

r

Mitglied der Helmholtz-Gemeinschaft

Mitglied der Helmholtz-Gemeinschaft

! Bin giv
α

x

x

Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2

6

Algorithm: Hough Transform

r Hough transformed

0.6 0.5 0.4 0.3 0.2 0.1 0

68 (x,y) 0 points
Entries Mean x Mean y RMS x RMS y 2.2356e+08 25 90 0.02905 51.96 0.1063 20

15

10

-0.1 -0.2
Mitglied der Helmholtz-Gemeinschaft

5

-0.3 -0.4 0 20 40 60 80 100 120 140 160 180 α Angle / ° 0

PANDA STT+MVD
1800 x 1800 Grid
7

Algorithm: Hough Transform

r Hough transformed

0.6 0.5 0.4 0.3 0.2 0.1 0

68 (x,y) 0 points
Entries Mean x Mean y RMS x RMS y 2.2356e+08 25 90 0.02905 51.96 0.1063 20

15

10

-0.1 -0.2
Mitglied der Helmholtz-Gemeinschaft

5

-0.3 -0.4 0 20 40 60 80 100 120 140 160 180 α Angle / ° 0

PANDA STT+MVD
1800 x 1800 Grid
7

Algorithm: Hough Transform
Two Implementations

Thrust
• Performance: 3 ms/event
– Independent of α granularity – Reduced to set of standard routines
• • Fast (uses Thrust‘s optimized algorithms) Inflexible (has it‘s limits, hard to customize)

Plain CUDA • Performance: 0.5 ms/event
– Built completely for this task • Fitting to every problem • • Customizable A bit more complicated at parts

– No peakfinding included
• • Even possible? Adds to time!

– Simple peakfinder implemented (threshold)

• Using: Dynamic Parallelism, Shared Memory

Mitglied der Helmholtz-Gemeinschaft

Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2

8

Algorithm: Riemann Track Finder
• Idea: Don‘t fit lines (in 2D), fit planes (in 3D)! • Create seeds
– All possible three hit combinations

• Grow seeds to tracks Continuously test next hit if it fits
– Use mapping to Riemann paraboloid

• Summer student project (J. Timcheck)

z‘
Mitglied der Helmholtz-Gemeinschaft

x

x

x

x
x

x

x
x

x

x

x

x

y

y

y

Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2

9

Algorithm: Riemann Track Finder
• GPU Optimization: Unfolding loops
for () {for () {for () {}}} int ijk = threadIdx.x + blockIdx.x * blockDim.x;

! 100 × faster than CPU version

⌘ 1 ⇣√ nLayerx = 8x + 1 − 1 2 p √ √ 3 3 243x2 − 1 + 27x 1 p pos(nLayerx ) = + −1 √ √ √ 2 / 3 3 3 3 3 3 243x2 − 1 + 27x

• Time for one event (Tesla K20X): ~0.6 ms
Mitglied der Helmholtz-Gemeinschaft

10

Algorithm: Triplet Finder
• Idea: Use only sub-set of detector as seed
– Combine 3 hits to Triplet – Calculate circle from 3 Triplets (no fit)

• Features
– Tailored for PANDA – Fast & robust algorithm, no t0

• Ported to GPU together with NVIDIA Application Lab

Mitglied der Helmholtz-Gemeinschaft

Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2

11

Triplet Finder — Time

Mitglied der Helmholtz-Gemeinschaft

Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2

12

Triplet Finder — Optimizations
• Bunching Wrapper
– Hits from one event have similar timestamp – Combine hits to sets (bunches) which fill up GPU best

Mitglied der Helmholtz-Gemeinschaft

Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2

13

Triplet Finder — Optimizations
• Bunching Wrapper
– Hits from one event have similar timestamp – Combine hits to sets (bunches) which fill up GPU best
Hit

Mitglied der Helmholtz-Gemeinschaft

Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2

13

Triplet Finder — Optimizations
• Bunching Wrapper
– Hits from one event have similar timestamp – Combine hits to sets (bunches) which fill up GPU best
Hit Event

Mitglied der Helmholtz-Gemeinschaft

Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2

13

Triplet Finder — Optimizations
• Bunching Wrapper
– Hits from one event have similar timestamp – Combine hits to sets (bunches) which fill up GPU best
Hit Event

Mitglied der Helmholtz-Gemeinschaft

Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2

13

Triplet Finder — Optimizations
• Bunching Wrapper
– Hits from one event have similar timestamp – Combine hits to sets (bunches) which fill up GPU best
Hit Event

Bunch

Mitglied der Helmholtz-Gemeinschaft

Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2

13

Triplet Finder — Optimizations
• Bunching Wrapper
– Hits from one event have similar timestamp – Combine hits to sets (bunches) which fill up GPU best
Hit Event

Bunch

!(N2) ! !(N)

Mitglied der Helmholtz-Gemeinschaft

Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2

13

Triplet Finder — Bunching Performance

Mitglied der Helmholtz-Gemeinschaft

Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2

14

Triplet Finder — Optimizations
• Compare kernel launch strategies
Dynamic Parallelism
Triplet Finder CPU

Joined Kernel
Triplet Finder

Host Streams
Triplet Finder
stream/ 1 stream bunch 1 bunch 1 stream// bunch

thread/ 1 thread bunch /bunch thread/ bunch 11 Calling Calling Calling kernel kernel kernel

GPU

block/ 1 block/ bunch 1 bunch 1 block/bunch

Joined Joined Joined kernel kernel kernel

Combining Combining Calling stream stream stream
TF Stage #1 TF Stage #2 TF Stage #3 TF Stage #4

TF Stage #1 TF Stage #1
Mitglied der Helmholtz-Gemeinschaft

TF Stage #2 TF Stage #2 TF Stage #3 TF Stage #3 TF Stage #4 TF Stage #4
Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2

15

Triplet Finder — Kernel Launches

Preliminary (in publication)

Mitglied der Helmholtz-Gemeinschaft

Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2

16

Triplet Finder — Clock Speed / Chipset

Preliminary (in publication)

K40 3004 MHz, 745 MHz / 875 MHz K20X 2600 MHz, 732 MHz / 784 MHz
Mitglied der Helmholtz-Gemeinschaft

Memory Clock

Core Clock

GPU Boost

Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2

17

Summary
• Investigated different tracking algorithms
– Best performance: 20 µs/event ! Online Tracking a feasible technique for PANDA
• Multi GPU system needed – !(100) GPUs

• Still much optimization necessary (efficiency) • Collaboration with NVIDIA Application Lab

Mitglied der Helmholtz-Gemeinschaft

Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2

18

Summary
• Investigated different tracking algorithms
– Best performance: 20 µs/event ! Online Tracking a feasible technique for PANDA
• Multi GPU system needed – !(100) GPUs

• Still much optimization necessary (efficiency) • Collaboration with NVIDIA Application Lab

! u o y k Than
Mitglied der Helmholtz-Gemeinschaft

rten Andreas He h.de c i l e u j z f @ n a.herte

Andreas Herten, DPG Frühjahrstagung 2014, HK 57.2

18