You are on page 1of 8

Data Science @ XING September 2020

DATA SCIENCE
Your service team with the Big Data heart

New Colleague!
Meet our new colleague in
central Data Science, César

Focus projects
• Armstrong Discovery

• Candidate Discovery

• Entity-Based Item Matching Operational Tracking


by David Cellnigg
• Search indexing architecture
Here at XING, almost every team tracks what users do via
Adobe or their own backend tracking, or even both. In
general, we can say that we are good when it comes to
tracking. But there's some pretty unfortunate problems:

For your nerds only Nothing is standardised.

Solving Oozie dependency Every team created their own version of a backend tracking,
problems: creating pipelines with their own version of what and how to track. This in
with AWS itself causes multiple problems:

1.The schemas of the data that is track are very different and
sometimes very hard to compare.
2.There's also no alignment on the tech that's been used.
Like our newsletter?
Previous issues are here.
3.It's hard to reuse what others built, because it's not really
meant to be used outside of its specific product.

This leads to inconsistent quality and coverage of data.

Data is spread out.

In order to get an outside-of-your-product view of the user,

DS Newsletter 1
Data Science @ XING September 2020

you have to find the data first. So you either know somebody that knows, or you are in for some hours
of searching. The further out you wanna go and the more holistic you want to see the user, the more
effort it is to connect the tracking data from the different teams. I'd even go as far and say that we
never really did that in the past.

So in general, it's not an easy task to leverage the data that we have. And that's a pity, because we (a)
have already a lot of data and (b) the data that we have is definitely interesting and helpful.

The good news is: we worked on a solution that will hopefully solve these problems and/or make it
easier to at least leverage the data. This solution has the very inspiring name of "Operational data
tracking" (aka Armstrong tracking, but that name should not be used anymore …).

What it will solve:


• Consistent and high-quality data: We provide a standardised schema that caters to most, if not all,
the needs across the company.
• State of the art implementation and centrally maintained infrastructure
• Easy access/use: The data is stored in one place, so you don’t have to search for the data

We also built components in different programming languages that allow an easy implementation and
usage. These components are already part of the mono-repos and ready to use. More information on
their usage can be found on https://operational-tracking.xing.io/.

In general, this allows us to have a real holistic view on the user and to leverage this data within our
Data Science products (and beyond). And that’s something that we should all long for. :)

Welcoming Our New DS Colleague: César

Hi, I’m César. I’m from a small town in Huelva called Lepe, although I currently live in Cadiz.
I’m a climber and love to be outdoors, in the mountains and on the beach.
I also love motorcycles. I have been riding them since I was a kid.
I’m Vegan as well.
I’m a technically/mechanically inclined person, so I like to fix stuff and learn
how it works, and I think that also applies to my work as an engineer.

• I’ve been using Scala since 2013-2014


• I really like the functional programming style
• I’ve worked with Spark on premise and in the cloud in the banking industry, as well as the postal
industry.
• I also have some experience with microservices.

DS Newsletter 2
Data Science @ XING September 2020

Focus project updates

Armstrong Discovery
DS work on new Disco services is now wrapping up as the deadline for v1 approaches! It's becoming
more important to test Disco production services, ensuring that all services work together, and
performing sanity checks on production data.

We built a topic meta-data endpoint which given a topic ID will return the information about this
topic, so far including a label (English or German) and a description. In the future, it will be possible
to add a logo or image for each topic.

We are also implementing Operational Tracking in disco-modules to ensure that everything is


tracked on the backend and that we are prepared to send and receive tracking from the frontend. This
also involved ensuring compatibility and continuity with the current ds-tracking for recommenders.

Topics: we created a relevancy threshold to suggest content on relevant topics and also optimized the
item-topic assignments and topic recommendations. These services are now being used in many places
in the Disco world, including the Explore tab (for topics the user is not following) and for "My
Network Updates" (followed topics).

An A/B/C/D test for the news recommender started on 24 Sept. thanks to the News Data team!
This will test features based on association rules and topic-based recos, also a combination of those
features. Topic-based recos deliver news articles based on our topic recommender and topic
classification services.

An analysis to predict "half-life decay" of article performance was also done. It turns out that
there are articles like "5 ways to know your boss sucks" or "The best ways to answer the 'weaknesses'
question in an interview" are perennial favorites, turning up year after year with good performance.
On the other hand, they are rather like click-bait and usually not very substantive. The News Editorial
team has advised us that these articles do add entertainment value, which in the right mix, offers a
light alternative to heavy informative articles.

Towards the shorter end of the time-scale, analysis revealed that articles coming from weekly
newspapers or monthly magazines were indeed relevant and clicked in the timeframe intended, while
articles from daily sources were also short-lived, their performance peaking and dropping off within
the day. The News Editorial team is providing us with a table of news sources categorized by daily,
weekly and monthly. This will be incorporated into the news recommender to recommend articles
within the appropriate timeframe. This analysis can also be extended within Armstrong Disco to
provide a differentiated view on what "fresh" content is.

Work started on "entity" recommenders, including the page recommender in collaboration with
the News Data team and the DS company recommender.. We are not only fitting these with topic
filters, but also ensuring that these recommenders are up-to-date to provide fresh recos to Disco
modules.

DS Newsletter 3
Data Science @ XING September 2020

Candidate Discovery
New recommendation layout: The new layout for the candidate recommendations in the XING
Talent Manager (XTM) is finally rolled out to 100% of the users. For each recommendation, a quality
score as well as highlights like e.g. "top city", "following your company or "very experienced" are
provided (described in more detail in the last ds-newsletter). While the recommendations are still
shown in a separate tab in the
XTM, the recruiters are now
automatically forwarded to this
recommendations tab when
they visit a project that does
not hold any candidates yet, e.g.
d i r e c t l y a f te r t h e p r o j e c t
creation.

Since the release, we see an item-based add-through-rate (ATR) of about 10%. This means, the
recruiters add on average 10% of all candidate recommendations they see to an XTM project. For
comparison, last year the item-based ATR was at 2% even though the recruiters were not
automatically forwarded but only interested recruiters visited the recommendations tab. Additionally,
the user success (i.e. the percentage of recruiters per week who did not only see but added at least one
recommendation to a project) increased from about 20% to 26% and the absolute number of
recruiters per week who add at least one recommendation to a project quintupled (from about 130 to
about 685). Given that we have about 4,000 recruiters per week who add at least one candidate to a
project (using any source in which the XTM search is the most common one), this makes 17% of the
recruiter which we support in their work by providing candidate recommendations. We aim to
increase this number even further, e.g. by extending the highlights and by making more recruiters
aware of the improved recommendations tab.

Re-ranking the candidate recommendations for XTM projects: With the new
recommendation layout being rolled out, we could extend the analysis for the interleaving test we
started for the new re-ranker. The re-ranker aims to bring the up to 1,000 candidate recommendations
that are created for a project in real-time in the most suitable order for the recruiter. We decided to
use an interleaving test instead of an A/B test as it speeds up the evaluation in the given scenario for
the rather small user group in the XTM. This means, instead of dividing the recruiters in a test and a
control group, each recruiter sees a list holding items provided by the old and by the new approach.
Please see the last ds-newsletter for a more in-depth description of the re-ranker (in the Candidate
Discovery update section) as well as for a detailed description of interleaving (cover story). While
most KPIs (e.g. item-based add-through-rate or message-rate) are quite similar for both approaches,
we see that the answer-rate of candidates recommended by the new approach is about 64% higher
(15.2% for the old approach vs. 25% for the new approach). This is most probably due to the fact that
the re-ranker takes the interest profiles of the users into account which are created based on the users'
interactions with job postings.

User Data Model: With the new user data model it is not just more easy and, thus, faster to
integrate new information into the recommender to e.g. filter the recommendations, but the response
time of the service could be decreased. By using the user data model and bringing the data loading in

DS Newsletter 4
Data Science @ XING September 2020

the most suitable order, we could halve our 99 response rate while still providing (and re-ranking)
1,000 recommendations.

Outlook: With the Zukunft Personal being mid of October, we are currently finalizing and polishing
our improvements for the candidate recommendations. The final steps include adding some additional
filters the recruiters requested, using newly available information (i.e. "ideal" candidate provided by
clients using XING's new active sourcing service), providing a new highlight ("the candidate visited
similar job postings from your company"). Finally, the search query suggestions that will help recruiter
to start a search e.g. based on as many project details as possible while still ensuring a certain amount
of search results is about to be released.

Entity-Based Item Matching


Job recommender - we AB-tested an increase of number of results of the profile-based sub-
recommender. The test concluded positively, letting main KPIs mostly unaffected, while the same
KPIs considering results from the sub-recommender only showed a positive effect. The number of
users with no recommendation was also reduced significantly. This will allow future improvements to
the profile-based sub-recommender to have a bigger impact on the job recommender end results.

We are continuing our work on using entity relationships and entity importance in the profile-based
sub-recommender of the job recommender. More about this next time.

Search Indexing Architecture


This month we have shifted the focus to the migration of the profiles-related indices to the new
indexing architecture. This process is quite challenging since it requires data from many different
sources which requires the involvement of several teams. Ideally, we want all this data in Kafka, so we
have been evaluating frameworks that performs operations with topics.

On the other hand, we are working on a new initiative to build a data catalog. A data catalog is a
central place where data assets of the company will be classified, searchable and represented with its
lineage. Many teams have collaborated defining the requirements a data catalog should fulfil and we
have been evaluating open source and enterprise solutions. We will be giving more details on this
initiative in following newsletters :)

New ontology tooling for stakeholders


by Inga Zager

Apart from the new curation tool we've introduced in the last
newsletter, we've also been working on stakeholder tooling for
the ontology over the last few months.

To make it as easy as possible for you to give us feedback on the


ontology data, we've created a whole new stakeholder section
for you that you can find here.

You'll land on a status overview page with two tabs, where you

DS Newsletter 5
Data Science @ XING September 2020

can follow up the progress on your requests.

To actually put in requests, you can find two buttons in the right hand corner where you can choose
between two options:

1. Re q u e s t a n e w l a b e l : Yo u ' v e
discovered that we are missing one or
several labels in the ontology? This it
the way to go! Here you can let us
know what we are missing and which
concept it might belong to. Our
curators will check your request and
add the label/entity, if the quality is
good.

2. Request changes for existing entities: You came across an error in the existing ontology data,
e.g. typos, wrong translations, labels in the wrong entity? We're happy to have a look and correct
the mistakes.

All you have to do now is check back once in a while and wait for feedback by our curators in the
status overview!

We are looking forward to lots of valuable feedback from you! Every hint that enables us to improve
the ontology data is highly appreciated. Please also let us know if you face any problems using the
tooling or any missing functionality.

Many thanks to Andrew, Annika, Dema, Mirko, Saif and Zaher for creating this tooling!

For Your Nerds Only: Oozie Flow


by Zaher Mousa

Motivation:

We often have dependencies between workflows (Oozie, Spark ...), where these workflows should run
after each others as a pipeline.With current Oozie implementation it's only possible to make a dataset
dependency, but this is really not efficient when we have around 5 → 10 workflows that should run in
sequence (similar to Ontology release process) and to be able to maintain the recover form failures
related to the cluster health.

Also some workflows should not run simultaneously to prevent inconsistent data state, or even to
remove some intermediate data which is shared.

The idea is to make out of this problem a generic solution, as an event-based solution which based on
some topics that hold the commands and states of the workflows, and a serverless approach for the
main logic.

Tech Stack:

DS Newsletter 6
Data Science @ XING September 2020

For Hackweek, the idea was to use the capabilities of AWS for a fast ready-to-use infrastructure
solution, and below some of the components used:

• Terraform: infrastructure as code software tool used for automatic provisioning

• MSK: Managed Streaming for Apache Kafka, released mid of 2019

• ECS: Elastic Container Service, is a fully managed container orchestration service

• KafkaRestProxy: provides a RESTful interface to Event Store For Apache Kafka clusters to
consume and produce messages and to perform administrative operations

• KSQL: is an event streaming database purpose-built to create stream processing applications on


top of Apache Kafka, released end of 2019

• ApiGateway: is a fully managed service that makes it easy to create, publish, maintain, monitor,
and secure APIs

• Lambda: is an event-driven, serverless computing platform

• EMR: Elastic MapReduce (EMR) is a Web Services tool for big data processing and analysis

The first planned approach was to use a Lambda trigger, so the orchestration would be like
[ApiGateway → KafkaRestProxy (ECS) → MSK → Lambda → KSQL (ECS) → Oozie], where
messages about commands & states of the Oozie workflows could be received through Kafka and then
trigger the Lambda which is responsible of making the decision using the KSQL streaming query to
get the latest state of the previous workflow and check if any related workflow could block this one
from running and then make the action with Oozie API.

DS Newsletter 7
Data Science @ XING September 2020

Unfortunately this didn't go as planned since the integration between [MSK → Lambda] was only
published few weeks back and it still not supported in the EU region yet, so I had to fall back into
different approach where I used a Spark Streaming application over the topics instead of the Lambda.

The last orchestration was similar to this: [ApiGateway → KafkaRestProxy (ECS) → MSK →
SparkStreaming (EMR) → KSQL (ECS) → Oozie].

Since I used a VPC component with private subnets to hold most of the core components I was not
able to access Xing VPN to call the Oozie API, so I ended up testing the integration between these
AWS components without real test for the use-case, but anyways I realized that I'm complicating the
solution a bit (or a lot 🙂 ) and over-using some components so I fall back into more simple solution.

Actor based solution

Based on the above conclusion I fall back into simple rest/actor solution which could solve our two
major issues in Oozie (run if none of the conflicted workflow running now and run multiple workflows
in a sequential manner), respecting the two configuration models:

• Token Ring Group: represents the conflicting group where no two workflows should run at the
same time.

• Dependency Pipeline: represents the set of workflows that should be called after each other in a
predefined order (to preserve
state).

Actor looper will follow the


specified flowchart whenever a
new call for running a
workflow came up in order to
check the above issues and
then apply the command, or
wait and schedule another
check after a configurable time
interval.

DS Newsletter 8

You might also like