You are on page 1of 35

Expert session

PROD
This project has received funding from the European
Union's Horizon 2020 research and innovation
programme under grant agreement No 870822
Agenda
• Introduction
• A comparison of NACE and Microsoft Academic Graph (MAG) based
industry classifications
• Field of Study (FOS) code-based digitalization score
• Academy-Industry collaboration based on website data
• Q&A

This project has received funding from the European


Union's Horizon 2020 research and innovation
programme under grant agreement No 870822
Housekeeping issues
• The presentations will be recorded and published open access in the
project website.
• Please keep your camera and microphone turned off until the Q&A part
• Your question and comments in the Q&A part will not be a part of
the webinar recording.
• Feel free to put questions in the chat during the presentation.
• Please note that results presented are work in progress.
• Material is shared as a working paper.

This project has received funding from the European


Union's Horizon 2020 research and innovation
programme under grant agreement No 870822
BIGPROD web
scraped data

PROD
This project has received funding from the European
Union's Horizon 2020 research and innovation
programme under grant agreement No 870822
Addressing the productivity paradox
• The objective of the project is:
• To extend existing econometric approaches on productivity with a theoretically sound “Big
data” measures that can be operationalized and validated through pilots.
• To have deep stakeholder consultation mitigating the skills gap, creating transparency,
enabling stakeholder influence in sources and tools and enabling policy makers being
informed on tools and pilots.

• Project website at https://cordis.europa.eu/project/id/870822


Project objectives
1. Management and coordination for utilizing “Big data” for innovation and productivity
assessment
2. Creating an extended econometric framework for the evaluation of the productivity-
innovation link based on “Big data”
3. Building a large-scale data platform and framework which will yield frequent batch data on
company performance and innovation activities
4. Create policy-relevant pilots that measure the impact of proposed changes, while enabling
policymakers being informed on tools developed and piloted.
5. Utilizing the most effective tools available to effect stakeholder engagement and co-creation,
while simultaneously ensuring the dissemination of the knowledge gained in this process to
the wider public.
This project has received funding from the European
Union's Horizon 2020 research and innovation
programme under grant agreement No 870822

Data
Data processing

183,161 medium- o Manufacture of basic pharmaceutical products and pharmaceutical preparations (21)
high and high tech Targeting
o Manufacture of computer, electronic andvarious
optical aspects
productsof(26)
companies micro-level
o Manufacture of air and spacecraft innovation
and related activity
machinery (30.3) Enriching by linking
Web scraping
o Manufacture of chemicals and such as collaborative
chemical activities,
products (20) to publicly available
o Manufacture of weapons and company’s
ammunition products,
(25.4) and use of data
o Manufacture of electrical equipment (27) standards
96,921 med-high and high-technology companies
o Manufacture of machinery and equipment n.e.c. (28)
o Manufacture of motor vehicles, trailers and semi-trailers
Constructing the DB (29)
o Manufacture of other transport equipment (30) excluding Building of ships and boats
(30.1) and excluding Manufacture of air and spacecraft
Relational andasrelated
database machinery
a PostgreSQL (30.3)
database
o Manufacture of medical and dental instruments and7supplies
including (32.5)
data tables and 28 variables

Financial data

47,826 companies
EU-27 and UK
This project has received funding from the European
Union's Horizon 2020 research and innovation
programme under grant agreement No 870822

Database structure
What is on websites anyway?
• Websites offer complementary data on companies’ innovation
activities, compared to patenting and publication activities,
particularly in downstream innovation activities (Gök et al., 2015).
• Interesting work on capturing innovation related data from websites
include Kinne & Axenbeck (2020), Arora et al. (2020) and Li et al.
(2018).
• We did a content analysis of sample of BIGPROD data companies.
• The content analysis of 38 companies’ websites, including large and SME
firms in both B2B and B2C; showed that the website information can be
categorized into 7 categories:
What we get from webpages

Category Description
The website attempts to signal the competitive advantages of the firm’s products and services. The competitive
Competitive
1 advantages can be related to quality and technology level, multi-aspect oriented, affordability, and product/ service
advantage
standards.
Competence The website communicates the firm’s competencies and capabilities. This message also can be highlighted using the
2 and firm’s knowledge/capabilities in offering diverse solutions, firm’s leadership and dominance in the market as well as
capabilities the relations with other the firms/ brands.
Corporate The website communicates corporate social responsibilities in terms of how social responsibility concerns are
3 social engaged in the company’s business activities and policies. Corporate social responsibility may include sustainability
responsibility issues, philanthropic activities, as well as inclusion and diversity.
What we get from webpages
Category Description
Ethics and The website may communicate the firm’s ethics and compliances, which can be explained through
4
compliance codes of conduct and ethical frameworks.

Organization The website describes the organizational structure, investors relations, and corporate governance. This
5
al structure message also may cover the firm’s mission and vision, long-term strategies, and growth framework.
The website presents the financial documents and earnings of the company, to show the profitability
6 Financials of the company.

The website targets the current and future suppliers and logistics partners to communicate the firm’s
7 Supply chain strategies in alliances in growing the business.
A comparison of NACE
and Microsoft Academic
Graph (MAG) based
industry classifications
PROD
This project has received funding from the European
Union's Horizon 2020 research and innovation
programme under grant agreement No 870822
European Classification of Economic Activities (NACE)
Limitations
• Distinguishing all activities and being inclusive.
• Changes in economic structures and organizations and technological
developments give rise to new activities and products, which may supersede
existing activities and products.
• The difference in the identification and grouping of similar economic activities
associated with moving to the new NACE implies a statistical break in the time
series.
• Therefore, NACE classification is under constant review process. NACE Rev. 2 the
detail of the classification has substantially increased (from 514 to 615 classes).
Mitigating NACE limitations
• Since NACE Rev 2., introduction, attempts been made to map and cross-validate the industry-
driven codes to research and technologies (Schmoch et al., 2003) -> analysis relationships
between industries from their technology capabilities
• The background for reallocation of different indicators is that many indicators commonly used in
innovation research at the meso-level to measure the output of innovation systems are measured
at different scales (Neuhäusler, Frietsch and Kroll, 2019).
• In order to assess the effects of innovation indicators on various social, economic, environmental
and technical events, some concordance efforts have been made to re-allocate difference
indicators to each other (Frietsch et al., 2017; Neuhäusler et al., 2017).
• In recent attempts, probabilistic concordance schemes has been generated for assignment of
patents and scientific publications to NACE codes (Neuhäusler, Frietsch and Kroll, 2019).
Methodology
• New NACE classifications reallocation and concordance model.
• Web scrapping exercise on companies' web pages and retrieved the textual content
• Map companies' activities indicated
in their websites to a hierarchical topic Company meta data
from ORBIS
modeling classification - Company ID
- Company Name
- NACE
• Populating companies' NACE code classification
- Website address
classification along the
web scraped topic modeled Mapping text Website Scrapping
Business and
new classifications. economic activity
content to MAG
- Classification by
– Automated process
with Python
classified by
NLP and topic scripting language
NACE codes (4
modelling based -Text cleaning and
digits)
on text content harmonizing
Microsoft Academic Graph (MAG)
Microsoft Academic Graph
(MAG) is a large heterogeneous
graph comprised of more than
200 million publications and the
related authors, venues,
organizations, and fields of study.

This graph is used to power


experiences in Bing, Cortana,
Word, and in Microsoft
Academic.
NACE code allocation to FOS code
• 65 NACE codes to 4,818 FOS ids. 74 times
expansion of a NACE code.
• 65 NACE codes (Blue) and 4818 FOS codes
(Red) are presented as nodes with 7621 edges
connecting them. The size of the nodes indicates the
outdegree which counts the outgoing edges to the other
nodes.
Network
Visibility on Minimum 2 Minimum 4 Minimum 5 Minimum 8
various filtering connections connections connections Connections

Nodes 686 (14.05%) 336 (6.88%) 167 (3.42) 67 (1.37%)


Edges 2375 (31.16%) 1326 (17.4%) 655 (8.59%) 91 (1.19%)
Revealing the codes
connections
• Filtering the network to indegree on 5
reveals that there are over 100 FOS codes
that are connected to 60 NACE codes or in
other words expanding the definition of the
NACE codes.
NACE (4 digit code) and description Related FOS codes tag names
Manufacture of non-domestic cooling and ventilation • Filter paper • Butterworth filter • Vascular filter
2825
equipment
2829 Manufacture of other general-purpose machinery n.e.c. • m-derived filter • UV filter • filter design
Manufacture of other electronic and electric wires and • X-ray filter • Sinc filter • capacitor-input filter
2732
cables
Manufacture of other parts and accessories for motor • Surface filter • All-pass filter • Separable filter
2932
vehicles • Filter (function) • Scree filter • Prototype filter
2042 Manufacture of perfumes and toilet preparations
• Vacuum ceramic • Electronic filter
2011 Manufacture of industrial gases • Filter press
filter topology
2899 Manufacture of other special-purpose machinery n.e.c. • Network synthesis
• High-pass Filter • Carbon filtering
filter
2020 Manufacture of pesticides and other agrochemical products
• Pneumatic filter • Cloth filter
Field of Study (FOS)
code-based
digitalization score

PROD
This project has received funding from the European
Union's Horizon 2020 research and innovation
programme under grant agreement No 870822
What is digitalization and digital capability?
Use of digital technologies to innovate business routines toward more
efficient and flexible performance, providing new revenue streams
through defining new business models, and promoting competitive
advantages by exploiting value-producing opportunities.

Digitalization capabilities, as a second order capability can be


investigated under three different contexts of dynamic capability in
terms of Sensing phase, Seizing phase, and Reconfiguring phase
(Annarelli et al., 2021)
Digital capabilities from the dynamic capability
theory perspective
Ecosystem
capabilities
Sensing Phase Improving the
Supply chain managerial
Integration cognitions
capabilities
Deployment of IT
Digitalization Seizing Phase
Capabilities Organizing IT
capabilities

Improvisational Expanding the


capabilities resources’
diversity
Reconfiguring Timely resource
Phase reconfiguration

Monitoring the
digital
environment (Annarelli et al., 2021)
Product digitalization
• Despite all aspects of digitalization capabilities such as adopting of digital technologies, resources and
infrastructure, digitalization capabilities deliver different viewpoint than the firms’ orientation for
development of digital products.
• In manufacturing industries competitive advantages are promoted based on the technology and products.
The promotion of new technologies is significantly substantial in high-tech industries and R&D inputs are in
direction of new product development. (Hagedoorn and Cloodt, 2003 ;Björkdahl, 2020)
• The promotion of products into innovative digital products is a profound form of digital innovation.
Deployment of digital components in products can increase efficiency and functionality of product, and also
simplify the editability/upgradability. (Björkdahl, 2020)
• However, development of digital integrated products (digitized products) may require redesigning of
traditional physical products, that can impose high costs of develpment. Moreover, the hidden burdens such
as restructuring of development process, like testing the new product which is crucial in manufacturing
industries, or reusability of development platforms should be considered. (Björkdahl, 2020)
Measure development (data source)
• Measuring the capabilities using the conventional data source is hampered by the data coverage (Arora et
al., 2020). Instead, this analysis measures digitalization using a novel methodology through the companies’
webpages.
• Websites provide valuable information of information on company behavior (Gök, Waterworth and Shapira,
2015; Kinne and Axenbeck, 2020; Axenbeck and Breithaupt, 2021)
• Communicating the firm capabilities throughout the website enables the use of webpage in the
development of extensive internal capability measures at a large scale. Moreover, utilizing webpages as data
source facilitates more frequent and updated data in comparison with the conventional data source (Arora
et al., 2020).
Operationalization
Digitalization scores, investigate the presence of FOS ids associated with computer science.

𝑛𝑑𝑖𝑔𝑖𝑡𝑎𝑙
𝑃𝑟𝑜𝑑𝑢𝑐𝑡 𝑑𝑖𝑔𝑖𝑡𝑎𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛 =
𝑛𝑛𝑜𝑛−𝑑𝑖𝑔𝑖𝑡𝑎𝑙 + 𝑛𝑑𝑖𝑔𝑖𝑡𝑎𝑙

𝑛𝑑𝑖𝑔𝑖𝑡𝑎𝑙 is the number of digital products of the firm, 𝑛𝑛𝑜𝑛−𝑑𝑖𝑔𝑖𝑡𝑎𝑙 is associated to the number of non-digital products.

𝑖
σ𝑛𝑖=0 𝑥𝑑𝑖𝑔𝑖𝑡𝑎𝑙
𝑐𝑎𝑝𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑑𝑖𝑔𝑖𝑡𝑎𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛 = 𝑗
𝑖
σ𝑛𝑖=0 𝑥𝑑𝑖𝑔𝑖𝑡𝑎𝑙 + σ𝑚
𝑗=0 𝑥𝑛𝑜𝑛−𝑑𝑖𝑔𝑖𝑡𝑎𝑙

𝑖
𝑥𝑑𝑖𝑔𝑖𝑡𝑎𝑙 is the similarity score of FOS id i, known as a digital FOS id, when the website contains n digital FOS ids,

𝑗
𝑥𝑛𝑜𝑛−𝑑𝑖𝑔𝑖𝑡𝑎𝑙 is the similarity score of FOS id j, known as a non-digital FOS id, when the website contains m non-digital FOS
ids
Case examples

• CIRCONTROL (https://circontrol.com): • SPHERON-VR (www.spheron.com);


• product digitalization score 0.33 • product digitalization score 1.0
• Selected products: • Selected products:
• eVolve Smart - charge station, urban environment, energy • SCENECAM™ - led array, digital documentation, spherical image,
market, initial capital, strength design, one pole, capital vertical axis, crime scene, high-dynamic-range imaging, computer-
generated imagery, high dynamic range, fill factor ;
investment, service station, fast charging station
• SCENECENTER™ crime scene, high dynamic range, software
• charging points - charge station, load management, fast technology, critical infrastructure protection, workflow;
charging, range anxiety, gas emissions, distribution
• SCENEWORKS™ crime scene, high dynamic range, forensic
networks, ultra fast, detect and avoid science, documentation system;
• automated vehicle license recognition - license plate,
parking space, different level, vehicle detection, user
satisfaction, smart parking

Picture source: https://circontrol.com/ and www.spheron.com


Academy-Industry
collaboration based on
website data

PROD
This project has received funding from the European
Union's Horizon 2020 research and innovation
programme under grant agreement No 870822
Number of connections
• Literature has extensively looked at the importance of • What we are proposing here is an additional vantage point
collaboration and in particular the collaboration to measuring collaboration between industrial actors and
between industrial actors and research organization between industrial actors and research organization.
(Cohen 2002; Suominen 2018)
• The measure is based on a web scraping a sample of
• Particularly ecosystems have seen a lot of research medium-high and high-technology companies from EU
and UK.
• Methods to operationalize collaboration include, but • The main motivation for the work is to offer additional
are surely not limited to measures to the partial views offered by existing
measures.
• Joint patenting (Petruzzelli 2009)
• Making no claims of superiority we see that the
• Research grants from industry to research organizations webscraped data offers insights to:
(D’Este et al. 2013) • Collaborative differences between different industries or by the size of
companies
• Survey data (e.g. Community Innovation Survey) (Kobarg et • Deep analysis on a geospatial level, analyzing regional ecosystem and
al 2013) the importance of distance at scale-
• Analyzing collaboration by thematic factors
• Co-publishing between academia and industry (Abramo et
al. 2009) • Analyzing collaborative differences between industry-industry and
industry-research organization
Number of connections
• The data contains 222 756 instances of
collaboration. From these:
• 57 899 are unique collaborators

• connected to 18 697 companies from the total


sample.

• Overall,
• 19.4 % of the companies have mentioned their
collaborative activities on their website,

• 9,5 % of the had RTO collaboration, and

• the average degree of the network is 5,785.


Data was complemented with geolocation
• Taking the organizations name and country
information, Google API was used to retrieve the
latitude and longitude for each organization, would
these be from the sample or collaborator.
• OpenStreetMap API was used as a secondary source.

• From the collaborators 52 029 organizations were


geocoded, which is 91,2 percent of all
organizations.

• From the sample companies, 87 354 were


geocoded, which represents in total 90,9 percent of
the sample.
Differences between industry
• The collaboration sample does not have an equally
large sample of companies in different industries.
• We should note that the BIGPROD data does not contain a All All RTO RTO
equal distribution of companies across NACE classes NACE count mean std mean std
2-level
20 6634 0.67 3.24 0.19 1.26
• Seen in the Table at NACE two-level, there is an 21 1758 1.76 6.99 0.68 3.13
25 137 0.83 2.55 0.18 0.75
over representation of NACE 28, while NACE 25 and 26 7149 1.02 4.36 0.35 2.27
30 have a significantly low number of companies. 27 6374 0.67 3.07 0.18 1.19
28 17854 0.49 2.54 0.13 1.27
• This said, NACE 21 has a significant number of 29 3281 0.84 3.22 0.21 1.10
30 411 2.12 5.97 0.73 2.76
collaborations, having the highest mean and median values
32 2459 0.77 4.43 0.33 2.29
in both collaboration types.
Differences between industry
• Current literature focuses on the use of patent or
publication collaboration as an indication of a
academia-industry linkages.

• Using the BIGPROD data, we evaluate if there is a


relationship with a company having publication or
patents and the companies’ collaborative behavior
(specifically with RTO).

• Figures show the distribution of connections


overlayed with the company having scientific
publications or patents.
• Mann-Whitney U test shows that there is a statistically
significant difference between companies with publications
(U= 15311413.5, p<0.05) and patents (U= 24791264.5,
p<0.05) in the number of connections to RTO..
Differences between industry
• To reflect on the overlaps and differences in
coverage of co-publications between the
companies vs. collaborations mentioned on
their websites, we mapped the collaboration WEB SCRAPED
WEB SCRAPED
WEB SCRAPED
network using co-publishing data collaborators COVERAGE
COVERAGE
AND CO- WEB SCRAPED
MAGNITUDE
for the 81 entities with publication activities. COMPARED TO
COMPARED
PUBLICATION ADDITIONALITY
CO-PUBLICATION OVERALL
TO ALL
• The approach generated four additional variables for
AVG 2,3620 0,3082 0,0418 0,2664
defining the overlap and difference between co-
publishing data and web scraped data. MIN 0,0087 0,0087 0,0010 0,0077

MAX 31 1 0,142857143 0,967741935


• These four variables are constructed upon three major
groups 1) collaborations in web scraped data 2) MED 0,1546 0,1370 0,0323 0,1137
collaborations in co-publishing bibliometric data, 3)
STDV 7,0344 0,3052 0,0339 0,2934
“collaborations both in 1 and 2”. The results are seen in
the table.
Case study examples
Q&A

PROD
This project has received funding from the European
Union's Horizon 2020 research and innovation
programme under grant agreement No 870822
Quantitative Science and Technology Studies team,
Foresight-driven Business Strategies, VTT Technical
Research Centre of Finland

Competence Center Innovation and Knowledge Economy,


Fraunhofer ISI

Economics of Knowledge and Innovation team, Maastricht


University

Public Policy and Management Institute

Economics of Technology and Innovations, Faculty of


Technology, Policy and Management, Delft University of
Technology

School of Government & Public Policy, Faculty of


Humanities & Social Science, University of Strathclyde

This project has received funding from the European Union’s Horizon 2020
research and innovation programme under grant agreement No 870822

You might also like