You are on page 1of 9

Extracting Social Media Data from LinkedIn,

Facebook, and Twitter

2011 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means
(electronic, photocopying, recording or otherwise) without prior consent of Informatica Corporation.
Abstract
PowerExchange for LinkedIn, PowerExchange for Facebook, and PowerExchange for Twitter provide native, high-
performance connectivity to social media data in popular social networks like LinkedIn, Facebook, and Twitter. This article
demonstrates how to use the adapters to search and extract the social media data.
Supported Versions
PowerExchange for LinkedIn 9.1.0
PowerExchange for Facebook 9.1.0
PowerExchange for Twitter 9.1.0
Table of Contents
Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Before You Begin. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
The Social Media Demo File. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Configuring the Social Media Objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Topic Search Mapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Topic Twitter Pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Twitter User Mapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Twitter User Details Workflow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Topic Search Workflow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Social Media Connections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Search Criteria Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Sample Output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Introduction
PowerExchange for LinkedIn, PowerExchange for Facebook, and PowerExchange for Twitter can extract social media data
that match the search criteria that you specify.
You can define a search criteria, search for topics, and extract social media data from all three social media networks. You
can load the extracted data to a target and then use the data for text analytics and sentiment analysis.
You can download the social media demo file that includes the necessary sample mappings, workflows, and sessions. Using
the demo file, you can extract the following data:
Public posts from Facebook that contain the topic.
LinkedIn profiles that have the specified topic anywhere in the profile information. The profiles that are searched are
the connections for the user account for which you provide an authentication token.
Tweets that contain the topic and the corresponding user details.
Before You Begin
Before you download and use the social media demo file, complete the following tasks:
1. Install and configure the following adapters:
2
PowerExchange for LinkedIn
PowerExchange for Facebook
PowerExchange for Twitter
2. The social media demo file uses an Oracle database for target definitions. Verify that you have access to a
database to create target tables.
The Social Media Demo File
Download and import the social media demo file, TopicSearch_SocialMedia.XML, available at
https://communities.informatica.com/docs/DOC-5484. The download is a zip file that contains the exported objects in the
XML format. Import all the repository objects listed in the XML file into PowerCenter.
The social media demo file demonstrates how you can search for a specific topic and extract the social media data from
LinkedIn, Facebook, and Twitter. The Integration Service connects to the social media sites and extracts the topic data
based on the search criteria. The search criteria is defined through a query string and a workflow variable. It loads the social
media data to separate targets based on each site. For the Twitter site, the Integration Service extracts the tweets first. For
each tweet that is extracted, it extracts the corresponding Twitter user details.
The demo file contains the following objects:
Social Media Mappings
The m_Topic_Search mapping maps the social media sources to a target database. The mapping extracts the
following social media data:
LinkedIn profiles that have the specified topic anywhere in the profile information.
Facebook public posts that contain the topic.
Tweets that contain the topic.
The m_read_Twitter_user mapping extracts the Twitter user profiles for each of the tweets.
Social Media Workflows
The wf_m_topic_search workflow contains the m_Topic_Search mapping that maps the social media sources to a
target database.
The wf_m_read_Twitter_user_details workflow contains the m_read_Twitter_user mapping to extract the Twitter
user profile for each Twitter user ID and store it in a target database.
Workflow Variable
The search topic is defined in a workflow variable for the wf_m_topic_search workflow. The variable is provided in
the query string that the API of the social media uses to search for the social media data.
Configuring the Social Media Objects
To run the workflows, you need to create target tables and configure the objects based on your environment.
1. Create the target tables. The demo uses an Oracle database for target definitions. You can modify the target
definition to use a different target database.
2. Import the source definition if you changed the database type.
3. Configure the Java transformation code. Modify the parameter file path and the options of the pmcmd
startworkflow command.
4. Create the connection objects. Specify the connection attributes for the social media adapters.
5. Choose the Integration Service for the sessions.
6. Modify the search criteria. Update the workflow variable for the topic you want to search.
3
Topic Search Mapping
The mapping m_Topic_Search contains multiple pipelines to extract LinkedIn profiles, Facebook posts, and tweets.
The mapping includes the following pipelines:
Facebook Posts
LinkedIn People
Twitter Entry
Topic Twitter
The following figure shows the m_Topic_Search mapping:
The mapping contains the pass-through pipelines for Facebook, LinkedIn, and Twitter that extract data based on the search
criteria and load to a relational target.
Topic Twitter Pipeline
The Topic Twitter pipeline contains transformations that extract and store the Twitter user profile for each of the tweets that
the Twitter Entry pipeline extracts.
The pipeline contains the following components:
Topic_Twitter(Oracle) source
The Topic Twitter source is the target of the Twitter Entry pipeline and contains the information about the user
tweets.
SQ_Topic_Twitter transformation
An Application Source Qualifier transformation that passes the Twitter user tweets data to the Remove_Duplicates
transformation.
4
Remove_duplicates transformation
An Aggregator transformation that defines a group for the Twitter user ID. The Integration Service returns one user
ID as the output for multiple tweets that match the search criteria by the same user. The Integration Service
passes the Twitter user ID to the Java transformation.
Run_UserFetch_Workflow transformation
A Java transformation that includes the pmcmd command to start the wf_m_read_Twitter_user_details workflow.
The Java transformation creates a parameter file that defines a workflow variable, $$Twitter_User. The Twitter
user ID for each tweet is stored in the workflow variable. The wf_m_read_Twitter_user_details workflow extracts
the user details for the Twitter user ID. The Java code repeats to extract the Twitter user profiles for each user ID.
The location of the parameter file is defined in the string ParamterFilepath and has a value C:/Informatica/
9.1.0HF1/server/infa_shared/TgtFiles/Twitter_User_Param/. Modify the parameter file path to any location
accessible by the PowerCenter Integration Service.
The Java transformation includes the following Java code:
System.out.println("Inputfile read: " + Username);
String ParamterFilepath = "C:/Informatica/9.1.0HF1/server/infa_shared/TgtFiles/Twitter_User_Param/" +
Username;
try
{
outstr =
new DataOutputStream(
new BufferedOutputStream(
new FileOutputStream( ParamterFilepath )));
System.out.println("Outputfile created: " + ParamterFilepath );
outstr.writeBytes("[GLOBAL]");
outstr.writeBytes("\n");
outstr.writeBytes("$$Twitter_User=" + Username);
outstr.close();
String StartWorkflowCMD = "pmcmd startworkflow -sv DI_SM_HF -d Domain_inbgoofy_910HF1 -u Administrator -
p Administrator -f Usecases -paramfile " + ParamterFilepath + " -wait wf_m_read_Twitter_user_details";
try{
Process CMD = Runtime.getRuntime().exec (StartWorkflowCMD );
//CMD.waitFor();
System.out.println("Inputfile read:2 " + Username);
}
catch(Exception e1)
{
e1.printStackTrace();
}
System.out.println("Inputfile read:3 " + Username);
}
catch ( FileNotFoundException nfx )
{
System.out.println("Problem opening files" );
}
catch ( IOException iox )
{
System.out.println("IO Problems" );
}
System.out.println("Inputfile read: " + Username);
Username_out = Username;
In the Java code, update the options of the pmcmd startworkflow command. The following table describes the
options:
Option Description
-sv Integration Service name.
-d Domain name.
5
Option Description
-u User name.
-p Password.
-f Name of the folder containing the wf_m_read_Twitter_user_details workflow.
Target_Twitter_User_Parameter File target
You can delete the target file that the Integration Service creates. You do not need to review the data in this file.
The target is used for mapping validation.
Twitter User Mapping
The mapping m_read_Twitter_User contains the source definitions, target definitions, and transformations for the Twitter
user profiles. The mapping m_read_Twitter_User extracts the user profiles for the tweets based on the search criteria.
The following figure shows the Twitter user mapping:
The Twitter user mapping is a pass-through mapping containing a Twitter user source definition for the Twitter user profiles
and a Topic_Twitter target that stores the data extracted from the Twitter user source.
Twitter User Details Workflow
The workflow wf_m_read_Twitter_user_details contains the s_m_read_Twitter_User session and a Start task to run the
session. The m_Topic_Search mapping starts the wf_m_read_Twitter_user_details workflow using pmcmd in the Java
transformation.
The wf_m_read_Twitter_user_details workflow extracts the Twitter user profile for the Twitter user ID that the workflow
variable $$Twitter_User defines. The workflow variable is defined in the parameter file that the Java transformation uses to
start the workflow.
The workflow is enabled to run multiple instances concurrently with the same workflow name.
Topic Search Workflow
The workflow wf_m_topic_search contains the s_m_read_topic_search session and a Start task to run the session. The
m_Topic_Search mapping contains transformations that start the wf_m_read_Twitter_user_details using pmcmd.
The workflow uses a variable, $$topic, in the query string of the Application Source Qualifier to input the search criteria for
LinkedIn, Facebook, and Twitter. The workflow variable is defined in the Variables tab of the Workflow Properties for the
wf_m_topic_search workflow. The variable is of datatype nstring and a value of "Big Data."
Tthe Integration Service is configured to truncate the targets before it loads the data.
6
Social Media Connections
Configure the application connections before you run the social media sessions. Verify the connection to the target database.
Specify the connection attributes that the PowerCenter Integration Service uses to connect to the social media adapters as
described in the following table:
Social Media Connection Connection Attributes Description
LinkedIn User Token
User Token Secret
row limit
The user token and user token secret that
the Open Authentication (OAuth) script
returned.
Facebook row limit Maximum number of rows of data to
return.
Twitter row limit Maximum number of rows of data to
return.
Search Criteria Configuration
When you configure a session for a social media source, you specify the query string that the API of the social media uses
to search for the social media data.
The query string is defined in the Application Source Qualifier for each social media source in the session
s_m_read_topic_search. Use the workflow variable, $$topic, to input the search topic.
The following table describes the syntax of the variable, $$topic, in the query string:
Application Source Qualifier Syntax of Query String
LinkedIn (SQ_LinkedIn_People) keywords=$$topic
Facebook (SQ_Facebook_Post) $$topic
Twitter (SQ_Twitter_Entry) $$topic
The default value of the variable is "Big Data" and generates the following results:
LinkedIn profiles that contain the topic "Big Data" anywhere in the profile information. The profiles are connections
of the account used to generate the authentication details.
Facebook public posts that contain the topic "Big Data." Only publicly visible posts are extracted.
Tweets that contain the topic "Big Data" and the corresponding Twitter user profiles.
To change the search topic, modify the value of the variable $$topic in the Variables tab of the workflow properties.
7
Sample Output
The social media demo file defines a default topic "Big Data" to search for the social media data in LinkedIn, Facebook, and
Twitter.
The following figure displays a sample search result for the LinkedIn people profiles:
The following figure displays a sample search result for the Facebook public posts:
The following figure displays a sample search result for the Twitter user profiles:
Author
Vandana Rao
Lead Technical Writer
8
Acknowledgements
The author would like to acknowledge Raghu Rajanna and Ram Subramanyam Gopalan for their help with this article.
9