You are on page 1of 9

1st Week

Review
- R L A K S H M I N A R A Y A N A N
- A R U N
MOTO OF THE INTERN
As a basic Data scientist or any Data worker, knowing about all types
of data extraction is a key role, and as per the guidance of Mr.Krishna
Sir and Mr.Venkat Sir,the goal of this is intern is to have an experience
of a fully-fledged data engineer on how to work on the received data
and pre-process the data and work with it and also I was very
excited with the introduction of AWS as it is a field which is very new
to me and it is also growing in a very rapidly so with this intern we
can have a taste of the AWS as well !!!

Now in the next slides we will explain on how the first week will be
divided and the work flow ,the tools and code as well !!
1st Week Work Given :

Mission:
TASK 1
Dividing the entities from news paper article
using spaCy and storing it in a csv file using
Pandas . Both the packages belong to python .

TASK 2
Dividing the given task 1 into two works using
2 lambda function:
Lambda function 1: Storing the csv file in S3
bucket in AWS
Lambda function 2: Retrieving the stored data
from S3 bucket and storing it in any Data base
like MongoDB or DynamoDB etc.
Tools

spaCy Lambda(AWS)
Lambda is a serverless compute service provided by
01 Spacy is an open-source natural language processing
(NLP) library used for efficient text processing and 02 AWS. It allows you to run code without provisioning or
linguistic analysis. It provides pre-trained models and a managing servers, paying only for the compute time
simple API for tasks such as part-of-speech tagging, used. It enables event-driven architecture and provides
named entity recognition, dependency parsing, and scalability, flexibility, and cost efficiency for executing
more. Spacy is known for its speed, accuracy, and ease of small, independent functions.
use.

Boto3(AWS) IAM(AWS)
03 Boto3 is the Amazon Web Services (AWS) SDK for Python. 04 IAM (Identity and Access Management) is a service provided
It provides a Python interface to interact with various AWS by AWS that enables you to manage user access and
services, allowing developers to easily create, configure, permissions to AWS resources. It allows you to create and
and manage AWS resources programmatically. Boto3 manage users, groups, and roles, and assign fine-grained
simplifies the process of integrating AWS services into access control policies to secure your AWS environment.
Python applications and automating AWS infrastructure.
Code
#lambda fun1

import boto3
import csv
import pandas as pd
from datetime import date
from newsdataapi import NewsDataApiClient
import spacy as sp

def lambda_handler(event, context):


nlp = sp.load("en_core_web_sm")
api = NewsDataApiClient(apikey="pub_24882b9d820571920ae97c08ce45b818944b3")
response = api.news_api( q= "tech" ,category='technology')
text=[]
for i in response['results']:
title =i['title']
description = i['description']
content=i['content']
keywords=i['keywords']
text.append([title, description,content,keywords])
output = ""
for i in text:
output += f"Title: {i[0]}\nDescription: {i[1]}\nContent:{i[2]}\nkeyword:{i[3]}\n\n"
Code
doc = nlp(output)
df1=pd.DataFrame(columns=['entities','labels'])
for entity in doc.ents:
if(entity.label_=='ORG'):
df1=df1.append({'entities':[entity.text],'labels':[entity.label_]},ignore_index=True)

bucket_name = 'file_holder'
file_name = f'{date.today()}_news.csv'

df1.to_csv(file_name,index=False)

csv_string = '\n'.join([','.join(row) for row in text])

s3_client = boto3.client('s3')

try:
# Upload the CSV string to S3
response = s3_client.put_object(
Bucket=bucket_name,
Key=file_name,
Body=csv_string,
ContentType='text/csv'
)
print(f'CSV file saved to S3: s3://{bucket_name}/{file_name}')
except Exception as e:
print(f'Error saving CSV file to S3: {str(e)}')
raise e
lambda_handler(event={
"eventName": "dailyNewsUpdate",
"category": "tech"
}
,context={"news"}
)
Code
#lambda fun2
import boto3
import csv

def lambda_handler(event, context):

bucket_name = event['Records'][0]['s3']['bucket']['name']
file_name = event['Records'][0]['s3']['object']['key']
dynamodb = boto3.resource('dynamodb')
table_name = 'raw_data_news_api'
try:
s3_client = boto3.client('s3')
response = s3_client.get_object(Bucket=bucket_name, Key=file_name)
csv_data = response['Body'].read().decode('utf-8').splitlines()
table = dynamodb.Table(table_name)

with table.batch_writer() as batch:


csv_reader = csv.reader(csv_data)
header = next(csv_reader)
for row in csv_reader:
item = {}
for i in range(len(header)):
item[header[i]] = row[i]
batch.put_item(Item=item)

return {
'statusCode': 200,
'body': 'Data stored in DynamoDB successfully'
}
Code
except Exception as e:
return {
'statusCode': 500,
'body': f'Error storing data in DynamoDB: {str(e)}'
}
lambda_handler(
event={
"Records": [
{
"eventVersion": "2.1",
"eventSource": "aws:s3",
"awsRegion": "us-east-1",
"eventTime": "2023-06-30T12:00:00Z",
"eventName": "ObjectCreated:Put",
"s3": {
"bucket": {
"name": "your-s3-bucket-name"
},
"object": {
"key": "your-csv-file-name.csv"
}
}
}
]
},context={})

• Couldn't work in AWS as the dependencies file is more than 50mb so couldn't run in
AWS!!
Thank You

You might also like