You are on page 1of 3

R Integration for Golden Batch

This section outlines the implementation for R Integration of Golden Batch in <redacted>.

Scheduler
A quartz job has been configured to run at a configurable interval (currently every 12 hours) that
picks-up the R calculation jobs eligible for the current cycle. It primarily has three steps -

1. Creating data for R


The data required for computing the Golden Batch is within <redacted> (batch and batch machine
data). This data is exposed by <redacted> via flat files (CSV). The approach has been highlighted in
the “R Integration Approach” document. For reference the associated tables and Golden batch specific
sample metadata is included here.

● rcalc_job_metadata - the master table that stores the metadata.


○ job_name - ‘GOLDEN_BATCH’
○ frequency_in_sec - 86400
○ last_executed_at - ‘2015-04-26 10:10:10’
○ output_file - ‘batch_score.csv’
○ script - 'Golden_batch_logistic_regression.R’
○ working_directory - ‘/home/xyz/Golden_Batch/’

● rcalc_input_metadata - stores metadata about the CSV file(s) that is generated by <redacted>.
○ input_file - ‘batch_data.csv’
○ param_name - ‘inFile’

● rcalc_query_column_headers - describes the data that needs to be created; for each input file, it
stores the columns that need to be populated in the CSV file as per the requirement of the script.
○ column_name - ‘id’ or ‘"batchno"’ or ‘“batch_duration”’
○ column_sequence - 1 or 2 or 7
○ column_type - ‘NUMERIC’ or ‘STRING’ or ‘NUMERIC’
○ is_key - true or false or false

● rcalc_sql_queries - for each input file, it stores the query (or queries) that need to be run to
populate the data in the CSV file.
○ query_sequence - 1
○ sql_query -
The csv file ends up something like this -

“id”,”batchno”,”fbp_machine_duration”,”blender_machine_duration”,”compression_machine_duration”,
“packaging_machine_duration”,”batch_duration”
1,E64006,639.15,381.9,1106,965.53,0.81125
2,E64015,668.67,323.93,1300.13,997.33,0.763363636
3,E64021,609.85,327.97,1270.02,1124.75,0.727416667
4,E63190,759.8,305.92,838.65,1246.93,0.9215
5,E64019,679.58,501,712.88,895.6,0.839545455
6,E64023,1293.63,359.22,924.72,1025.45,0.711615385
7,AUI4012,813.43,384.02,1217.63,3883.3,0.695916667

--------------------------------------------------------------------------------------------------------------------------------------

2. Executing R script
Once the input file has been created, it is placed in the folder as indicated in the “working_directory”
field of the metadata table. Borrowing from the approach document -
“Rscript is an alternative front end for use in #! scripts and other scripting applications. It is an
executable that is part of the installation…”
We execute the runner script “Rscript.R” in the working directory that plays three roles -
● establish the environment for Rscript by including a line at the top
#!/usr/bin Rscript
● Parse the command-line arguments that indicate the working directory, the input file name
and the output file name
● Source the actual script - 'Golden_batch_logistic_regression.R’

Once the script has been executed the output file “batch_score.csv” is created in the working directory.

--------------------------------------------------------------------------------------------------------------------------------------

3. Parsing and persisting Batch performance data


The output file is in CSV format. Sample output file -
batch_score.csv
"batchNo","batch_performance_index"
"E64006",952.762028800373
"E64015",1077.16272228172
"E64021",1144.93838035462
"E63190",854.567063246252
"E64019",830.355429853947
"E64023",1265.50498010926
"AUI4012",2260.68825635614

We have made an assumption at this point that any R calculation job will give us an output in a
<key>-<set-of-values> format. For example, here the “batchNo” is the key for any given row while
“batch_performance_index” is the value for that key (here there is just one value per key, but
implementation supports multiple values per key).

The data is parsed and stored in rcalculation_fact table which uses the InfiniDB engine and hence we
are using CPImport similar to the one used for storing KPI Fact. that has the following major columns-
● job_name (same as in rcalc_job_metadata) - ‘GOLDEN_BATCH’
● row_id (key for each row) - ‘E64006’
● col_id (the name of the value) - ‘batch_performance_index’
● value_double (actual value) - 952.762028800373
● timestamp (execution time of script)

Note : Showing Golden Batch on the screen


The batch with the maximum batch performance index value is deemed to be the Golden Batch.

You might also like