You are on page 1of 28

ToDo :

- Tags
- EME command
- m_dump
- multifiles commands
- String functions
- Date Functions
- Lookup Functions

1)
Input file
Col1 col2
1A
2B
3C
4D

and desired output should be like below:


col1 1 2 3 4
col2 A B C D
===> Use Rollup with function Concatenation(in.col1) and concatenation(in.col2) with key as {}

2) Which version of ab inito works with hadoop with push down mechanism (Job gets executed
on hadoop). Describes how it works.
== > Any version of Abinito GDE:3.1.6.1 and Co-Op :3.1.6.6

3) Which component does not work in pipeline parallelism?


==> Components which is having MAX CORE parameter like In memory Rollup, In memory
Join, Scan , Sort

4) Abinitio Display records between 50-75..


==> Filter by Expression : use next_in_sequence() >50 && next_in_sequence() < 75 for 1st
requirement and next_in_sequence() !=5 for 2nd one.

Use the Run programme component in GDE and write the below command:
`sed -n50,75p file 1 > file 2`

5)
I have data like below.
source file:
EmpID sal
A 1000
B 2000
C 3000
D 4000

Lookup File
EmpID
A
B
Output File
EMPID Sal
C 3000
D 4000

I want empId, sal of employees those not present in lookup file


==> is_null(lookup("lookup_file_name",EmpID).sal)
use FBE component with following function.......
first_defined(lookup(lkp name,in.empid).empid, 123) != empid

6) In my sandbox i am having 10 graphs, i checked-in those graphs into EME. Again i checked-
out the graph and i do the modifications, i found out the modifications was wrong. what i have to
do if i want to get the original graph..?
==>
Say your orginal version number of the graph is "100" and after you made the first set of
modification and checked in, the graph gets a version of, say "102".
Now you checked out the latest version of the graph, i.e. version 102 and did another set of
modifications. After checking in (say new version number 105) you realise that the changes were
incorrect.

In such a case the correct version is 102 on which you have to make the second set of changes
again.

To achieve this, check out version 102 (select appropriate version number in check-out wizard),
check it in again without any modification and setting the "force overwrite" option on. This will
create a new version of the graph, say 108, and this version will be the same as version 102.

So now you have version 102 as the latest version with a new version number 108, you can lock
and make the correct modifications on it.

Another way is to branch out, but in your scenario it doesn't appear to be the right option.

I have used version numbers in the explanation, which can be replaced by "tag names".

==>
According to your problem statement, all the 10 graphs that you checked in into the EME get
stored there as the latest graphs with the latest tag. This thing also reflects in your current latest
sandbox. As you checked out and modified the graph by locking it in your own sandbox, and
found that the modifications were not correct, you can simply follow this simple approach. Dont
checking anything into the EME. Just re-checkout the graph from the EME (which is actually
unaffected by your modifications) over the modified graph in your sandbox. This will sort of
overwrite the unmodified graph (which you want) over the unwanted modified graph. This solves
your problem

7)
Suppose we have two input files A and B both having different columns & data, for example A
having 1 2 3 4 columns and B having W X Y Z, How to get output as 1 2 columns from A and W
X Y from B
==> We have to specify 1,2 columns from A w,x,y from B in fuse component o/p dml. then
combine two files
By using fuse components.
8) Hi If want to run the graph in unix !what command i need to use ?correct me if im wrong :-run
==>
To run a graph in unix follow the simple steps.

1. Deploy the graph to a wrapper script .ksh


2. In unix, run the wrapper script as described below.

script_path/xyz.ksh --> If no parameters defined.

script_path/xyz.ksh -PARAMETER1 <passed value1> -PARAMETER2 <passed value2> In case


input parameters are variables and not fixed.

air sandbox run parameter_set__path/xyz.pset --> In case .pset is defined.

9)
I have input like 1,2,3-7,8,9 How to generate output like below
1
2
3
4
5
6
7
8
9
==> First use the normalize component, length is length(in.value) and in the transform write
string_substring(in.value,(index*1)+1,1)..then you will get the output is
1
,
2
,
3
-
7
,
8
,
9

10)
For data parallelism, we can use partition components. For component parallelism, we can use
replicate component. Like this which component(s) can we use for pipeline parallelism?
==> Any component in the flow having no SORT component will do the pipeline parallelism.

11)
What is output index? How does it work in reformat?Does below function show Output index in
useoutput:1:if(in.emp.sal
==> Output index function is used in reformat having multiple output ports to direct which record
goes to which out port.
for eg. for a reformat with 3 out ports such a function could be like
if (value=='A') 1 else if (value=='B') 2 else 3

which basically means that if the field 'value' of any record evaluates to A in the transform
function, it will come out of port 1 only and not from 2 or 3.

12)
I have 100 records in input file if I run graph first time first 20 records will go to output, in next
run 21 to 40 records will go to output, like this in 5th run 81 to 100 records will to output. Please
let me know how can I achieve this?
==> Use a lookup file having only one field count and its initial value would be 0.
In filter by expression use next_in_sequence>lookup("lookup_name") and
next_in_sequence<=lookup("lookup_name")+20. After every run take count of processed records
through rollup and override the lookup file.
So in 2nd run you lookup file value of count would be 20 and in FBE limit would be 21 to
20+20=40.
Or in place of lookup u can use one file and read it value through graph param and use it with
next_in_sequence().

13)
I have a scenario like, I am getting a date field from input in MMDDYY format and I have to
check for same year as processing year then I should pass "", else if the input date is to previous
year to processing date and less than 0301 then I should pass Y.
==> Use date function to get the current date and apply date_year function to get the processing
year out of it.
Do the same for input date to get the input year.
If these two are equal set " " to output. Now to check input date is previous year and
difference is less than 301 days, use the date_time_difference function to check the difference of
days between input date and current date and assign the output accordingly.
==> Inputfile-->reformat-->outputfile
if((string_substring(in.indate),5,2)==string_substring(now(),5,2))"" else "Y";

14)
I have customer details as below,
cust_card_no item amount
10001 pen 10
10001 copy 20
10001 pen 10
10001 copy 20
10001 copy 20

now my question is to find the distinct count of item per customer and total ammount spend by a
customer?

O/p : 10001 2 80
==> You can do it in two ways. If you are comfortable with Rollups expanded view
transformation we can achieve the output using single Rollup component as below.

Use the rollup with key as customer_id and in transformation use sting_concat to concat all the
items purchased by a customer and create a temporary vector. In finalise sort this vector and
dedup this vector and take count of the vector and assign it to output field count. This will be the
count of distinct items purchased by customer. For total amount spent by customer you can use
aggregate function sum().

Way2 -> use one rollup with raw data key as customer id you can get the sum of amount here.
And in other flow sort the data on customer_id and item and dedup on same. Use rollup to this
flow on key customer_id and count you get here is count of distinct items the customer has
purchased. Now you need to join these two flows on customer_id and get the fields to output
==>Inputfile-->rollup-->outputfile(rollup key is {cust_card_no,item})
in the transform write
out.cust_card_no::in.cust_card_no
out.item::in.item
out.count::count(in.item)
out.sum_amt::sum(in.amount)

15)
I have file containing 5 unique rows and I am passing them through SORT component using null
key and and passing output of SORT to Dedup sort. What will happen, what will be the output.
==> SORT wont do any sorting as key is blank, it just reads 5 records and writes them to output
as it is. Now when this data is feed to dedup with blank key, it will output only one record as it
treats the 5 record as a group due to blank key and give the first record to output

16)
How can i sort the data using reformat component in Abinitio?
==> You can first create a vector using the rollup component and then use the vector_sort
function in the reformat and then expand the vector using normalize. I hope this helps

17)
How to Improve Performance of graphs in Ab initio? Give some examples or tips.thanks,
==>
1.Use MFS system using Partion by Round by robin.
2.If needed use lookup local than lookup when there is a large data.
3.Takeout unnecessary components like filter by exp instead provide them in
reformat/Join/Rollup.
4.Use gather instead of concatenate.
5.Tune Max_core for Optional performance.
6.Try to avoid more phases.
==>
To improve the perfomance of the graph,

1.Go Parallel as soon as possible using Ab Initio Partitioning technique.


2.Once Data Is partitioned do not bring to serial , then back to parallel. Repartition instead.
3.For Small processing jobs serial may be better than parallel.
4.Do not access large files across NFS, Use FTP component
5.Use Ad Hoc MFS to read many serial files in parallel and use concat component.

1. Using Phase breaks let you allocate more memory to individual component and make your
graph run faster
2. Use Checkpoint after the sort than land data on to disk
3. Use Join and rollup in-memory feature
4. Best performance will be gained when components can work with in memory by MAX-
CORE.
5. MAR-CORE for SORT is calculated by finding size of input data file.
6. For In-memory join memory needed is equal to non-driving data size + overhead.
7. If in-memory join cannot fir its non-driving inputs in the provided MAX-CORE then it will
drop all the inputs to disk and in-memory does not make sence.

8. Use rollup and Filter by EX as soon as possible to reduce number of records.

9. When joining very small dataset to a very large dataset, it is more efficient to broadcast the
small dataset to MFS using broadcast component or use the small file as lookup.

18)
How does the Index value come from in Normalize and when the vector element is not fixed
length how does the length of() function in Normalize work
==> Use the length_of(in.vector) in length function of normalize component

19)
I have a multifile having 8 partitions.
I want to join partition 1 and partition4 data.
Could someone explain the approach for the above requirement?
==> Use PBE to get partition 1 and 4 in seperate flows using this_partition=1 nd 4. Then use join
==> I guess, you should use this flow i/p -> FBE(this_partition() ==4 -> JOIN

20)
What is max core value? wat is the use of max core?
==> The maximum amount of memory used by the component per partition before it spills to
disk.
It is different for different components..suppose
For join max_core value is 64Mb, rollup-64Mb, scan-64 mb, sort-100mb, sort with in
groups-10mb like that.

21)
Generate surrogate keys for multi files
==> ((next_in_sequence() -1)* number_of_partitions()) + this_partition()
next_in_sequence = 5
number_of_partitions=8
this_partition =4
Formula : ((5-1)*8) + 4 = > 36
next_in_sequence() -1) * number_of_partition) + this_partition

22) Sort data using reformat


==> Yes. It is possible to sort data using reformat.
Try the below example.
Input file ----> Reformat -----> Output file
Input dml:
record string(" ") field; end;
Input data: 8,1,4,2
Reformat:
let string("")[integer(2)] fieldsplit=string_split(in.field,",");
out.sort_field::string_join(vector_sort(fieldsplit),",");
Output dml:
record string(" ") sort_field; end;
23)
How can I achieve cummulative sumary in AB Initio other than using SCAN component. Is there
any inbuilt function available for that?
==> Scan is really the most simple way to achieve this. Another way is to use a ROLLUP, since
it is a multistage component. You need to put the ROLLUP component into multistage format
and write the intermediate results to a temp array (I think they're called vectors in AI). The
ROLLUP loops through each record in your defined group.

Let's say you want to get intermediate results by date. You sort your data by {ID; DATE} first.
Then ROLLUP by {ID}. The ROLLUP will execute it's transformation for each record per ID.
So store your results in a temp vector, which will need to be initialized to be the size of your
largest group. Each time the ROLLUP enters the tranformation, write to the [i] position in the
array and increment i each time. As long as this is all done in the "rollup" transformation and not
the "finalize" transformation, it will run the "initialize" portion before it moves to the next ID.

I have done it this way, but the Scan is easier. I was doing a more simple rollup before I found
that I needed cumulative intermediate results, so I just modified my existing ROLLUP. Ab Initio
documentation does not explain this technique in detail, but it can be done. Let me know if you
need more detail and I can provide a better example.

24)
I have deptno, sal in the emp table now i want 3rd highest salary in dept wise for say if i have sort
like this

deptno sal
10 4000
10 3000
10 2000
10 1000
20 4000
20 2050

Now , i need 3rd highest salary in each group and if i dont have more than 3 records than i want
to display first highest salary
==> 1. Use a Sort component to sort on {Deptno,sal Descending}
2. Use Reformat to assign ranks to salary for each group as below
let string(|) t_dept="";
let decimal(|) t_sal=0;
let decimal(|) t_rank=0;
/*Reformat operation*/
out::reformat(in) =
begin
t_rank=if(t_dept!=in.dept)1 else if(t_sal!=in.sal)t_rank+1 else t_rank;
t_dept=in.dept;
t_sal=in.sal;
out.rank::t_rank;
out.dept::in.dept;
out.sal::in.sal;
end;
3. Then use filter transformation to filter out rank==3. You can use variable wherein you can pass
nth highest salary
==> (GOOD)I guess below code can help for 3rd max else 1st max for each group key.
Input_file --> sort (descending) --> Rollup --> output
type temporary_type=record
decimal("") rank ;
decimal("") sal ;
end;
temp :: initialize(in) =
begin
temp.rank :: 0;
temp.sal :: in.sal; 2200
temp.max_sal::in.sal; 2200
end;
temp :: rollup(temp, in) =
begin
temp.rank :: temp.rank+1;
temp.sal:: if((temp.rank ==1||(temp.rank==3)) in.sal else temp.sal ;
end;
out :: finalize(temp, in) =
begin
out.sal:: temp.sal;
out.dept_no :: in.dept_no;
end;

25)
i/p -> sort(key1)->rollup(key1)->sort(key2)->rollup(key2)->normalise->o/p.
==>

26)
If you have one input file it have some thousands of records then how to Seperating the header
and trailor records from the input file?
==> 1) If the header and trailor is denoted by some special separator like H or T, you can use
filter by expression and condition like (rec_type != "H" and rec_type != T")
2) You can use 2 dedup component and in first comp, keep first and next use comp, keep
last, so u can separate.

27)
What is .rec and .job file
==> When ever a graph is run along with it a recovery file is also created because if any failure
you can start at that point only .This is called .rec file
.job file(text representation of the metadata, components, and flows of the graph.) is a
catalog file which is created as the same name of graph with extension of .job and it calls
launcher which initiates process
==> When we invoke the script for the first time, the jobname is determined from graph or pset or
plan name and the mp job command is invoked to create jobname.job file. So basically the job
files is the text representation of the metadata, components, and flows of the graph.
The mp run command tells the Co>Operating System to read the jobname.job file and
execute the graph.
The mp run command also creates the jobname.rec file, which points to the location of
the internal job-specific files that enable the Co>Operating System to roll back or recover the job
in case it fails.

28)
Could you please explain me in details differences between Multi File s and Adhoc Files ?
==> collection of files is known as multifile. Collection of files having same dml function is
known Adoc multifile

29)
I have 10 million records in one file, if I develop a graph with this file, it takes more time to be
executed, but I want to decrease the execution time. How can you proceed?
==> You may use the following performance improvement techniques depending on the
situation:
1. You may convert the serial file to multifile system using a partition by key, if it is a serial file.
2. You may filter out all the records from the file that are unwanted for the process. Elimination
of records helps the cause.
3. If there are joins with any tables/files, try to use look up files for smaller tables/files .
Also you should use the larger file as the driver port for joins with bigger tables/files.
4. Use in memory sort for smaller file joins.
5. By any chance if you are unloading from a table,you may use order by in the SQL which
eliminates use of Sort component in the graph.

30)
Input file ----100 records.
a) Using a rollup,how many records will you get?
b) How can I get more than 100 records?
c) How can I get 1 record?
==> a. Using Rollup ,you can get 100 or less than 100 records.(depending upon the key ).
b.Use normalize to get more than 100 records.
c.Use Rollup with null key.

31)
What is the difference between $ and ${} substitution.
==> ${} substitution is similar to $ substitution except that the parameter must be preceded by
curly brackets.
If we talk about these in parameter definitions then -
1. If the interpretation is $ substitution then we can give the value as both $ substitution and ${}
substitution.
e.g. Parameter can be of name $AI_SERIAL or ${AI_SERIAL}
2. If the interpretation is ${} substitution then we can only give the value as ${} substitution
parameter.
e.g. Parameter can only be of name ${AI_SERIAL}

32)
If I will pass Null key to Join components, then what will be output?
If I am having two table- customer and transaction.
I am having 100 rows and 10 columns and I want to perform the join with NULL key, then how
much output records I will get?
==> If you are passing null key to Join component , you will get Cartesian product (input1
records X input2 records).
33) We have sort and sort within groups components. we can achieve the sort within group
functionality by placing two keys in sort group. Then why we have to go for sort within groups?
==> Use Sort within Groups only if records within a group are processed before being sorted on a
minor key i.e. if processing "A" needs records to be sorted on field {"1"} and later in the flow
processing "B" needs records to be sorted on field {"1", "2"}. in this case before processing "2"
and after processing "1" use sort within groups with major-key as {"1"} and minor key as {"2"}.
If records are not grouped to start with, use Sort with a multi-part key. There is no benefit to
using Sort within Groups directly after using a Sort component.
==> Usually when we use sort within groups we can sort the data by universally in the sense we
can make sure pure sort by using defined keys....but if we use only sort we cant decide data
dynamically and to again sort with another key the component time is increased graph complexity
increases ..so sort with in groups is perfect to sort data dynamically
==> Sort within group is used to sort records on the minor key which is already sorted on major
key.
For Example:
Let us suppose my data is coming like
Dept_Num, Proj_Num, Manager
-------------------------------------------
10, A, Prabha
10, C, Ajit
10, B, Sujata
20, D, Feraz
20, A, Prabha
20, E, Tanmoy
30, A, Prabha
30, C, Ajit
30, B, Sujata
20, D, Feraz
20, K, Munmun
20, E, Tanmoy
These records passed from sort within group having major key dept_num and minor key
proj_num then output will be
10, A, Prabha
10, B, Sujata
10, C, Ajit
20, A, Prabha
20, D, Feraz
20, E, Tanmoy
30, A, Prabha
30, B, Sujata
30, C, Ajit
20, D, Feraz
20, E, Tanmoy
20, K, Munmun

34)
How to find the schema either Star or Snowflake Schema in our project?
==> Based on joins between fact table and dimension tables

35)
What is Air Sandbox in Ab Initio?
==> air sandbox have many options. Each are having different use.
Such as air sandbox run will run graph pset plan
air sandbox show-common will show the included common sandboxes.

36)
Input file has below contents:
Ball Run
11
21
31
41
51
61
10
21
31
41
51
61
11
21
31
41
51
60

Required Output :
16
25
35
==> Hi Naman, Just use the below key_change function to solve your problem..
Code
1.out :: key_change(in1,in2)=
begin
out :: in2.ball%6 == 1;
end;

37)
how to Convert 4 way mfs to 2 way mfs ....can anyone tell me clear how to do with
==> We can convert 4way mfs to 2way by using different partitioning components.

38)
If we give NULL key in scan component what will be the output and in dedup with keep
parameter as unique?
==> SCAN : In case of {} key in scan it will give all the records in case of dedup if you keep it as
first it will give first record if you keep it as last it will give last record in case of unique only it
will give no records.

39)
How to do we run sequences of jobs ,,like output of A JOB is Input to Bhow do we co-ordinate
the jobs
==> Obviously Plan it is the best option.But we can make a file with all the graphs that are
needed to be executed in a sequence in the same order and do a while that executes each line one
after another writing the logs to a file.

40)
How does force_error function work ? If we set never abort in reformat , will force_error stop the
graph or will it continue to process the next set of records ?
==> Force_error function will not throw any error when we keep the reject threshold as never
abort. but it will reject all unwanted records per requirement and will provide you the message.

41)
How do you pass the parameters of a graph from the backend?
==> You can use the below the command to give form command.
air sandbox run / -parameter name

42)
How do you run a graph successfully with a reformat component which is failing due to data
issue??
==> You can keep the reformat reject threshold as never abort or
if you know for which column it is coming invalid ,we can use is_valid function and
provide some default values

43) EME Commands


==> 1) air object ls <EME Path for the object - /Projects/edf/.. > --- This is used to see the listing
of objects in a directory inside the project.
2) air object rm <Path> -- This is used to remove an object from the repository. Please be careful
with this.
3) air object cat <Path> --- This is used to see the object which is present in the EME.
4) air object versions -verbose <Path> --- Gives the Version History of the object.
5) air project show <Path> --- Gives the whole info about the project. What all types of files can
be checked-in etc.
6) air project modify <Path> -extension <something like *.dat within single quotes> <content-
type> --- This is to modify the project settings. Ex: If you need to checkin *.java files into the
EME, you may need to add the extension first.
7) air lock show -project <Path> --- shows all the files that are locked in the given project
8) air lock show -user <UNIX User ID> -- shows all the files locked by a user in various projects.
9) air sandbox status <file name with the relative path> --- shows the status of file in the sandbox
with respect to the EME (Current, Stale, Modified are few statuses)

44) tags
==> air tag diff <tag 1> <tag 2>

45) Multi files==> Multifile Commands


· m_mkfs
m_mkfs <name of the control partion> <URL of the first partion1> <URL of the first partion1>
<URL of the first partion1>.......
· m_mkdir : m_mkdir url : url must refer to a pathname within an existing multifile system.
· m_ls
· m_expand
· m_dump
· m_cp
· m_mv
· m_rm
· m_touch
· m_rollback
· m_kill
· m_env
· m_env -ev : m_env -ev displays the currently running version of the Co>Operating
system.
m_eval execute functions of ab-initio on Unix promp

46) STRING FUNCTIONs :


==> Function
char_string :: Returns a one-character native string that corresponds to the specified character
code.
decimal_lpad :: Returns a decimal string of the specified length or longer,left-padded with a
specified character as needed.
decimal_lrepad :: Returns a decimal string of the specified length or longer,left-padded with a
specified character as needed and trimmed of leading zeros.
decimal_strip :: Returns a decimal from a string that has been trimmed of leading zeros and non-
numeric characters.
ends_with :: Returns 1 (true) if a string ends with the specified suffix;0 (false) otherwise.
is_blank :: Tests whether a string contains only blank characters.
is_bzero :: Tests whether an object is composed of all binary zero bytes.
re_get_match :: Returns the first string in a target string that matches a regular expression.
re_get_matches :: Returns a vector of substrings of a target string that match a regular expression
containing up to nine capturing groups.
re_index :: Returns the index of the first character of a substring of a target string that matches a
specified regular expression.
re_match_replace :: Replaces substrings of a target string that match a specified regular
expression.
re_replace :: Replaces all substrings in a target string that match a specified regular expression.
re_replace_first :: Replaces the first substring in a target string that matches a specified regular
expression.
starts_with :: Returns true if the string starts with the supplied prefix.
string_char :: Returns the character code of a specific character in a string.
string_compare :: Returns a number representing the result of comparing two strings.
string_concat :: Concatenates multiple string arguments and returns a NUL-delimited string.
string_downcase :: Returns a string with any uppercase letters converted to lowercase.
string_filter :: Compares the contents of two strings and returns a string containing characters that
appear in both of them.
string_filter_out :: Returns characters that appear in one string but not in another.
string_index :: Returns the index of the first character of the first occurrence of a string within
another string.
string_is_alphabetic :: Returns 1 if a specified string contains all alphabetic characters, or 0
otherwise.
string_is_numeric :: Returns 1 if a specified string contains all numeric characters, or 0 otherwise.
string_join :: Concatenates vector string elements into a single string.
string_length :: Returns the number of characters in a string.
string_like :: Tests whether a string matches a specified pattern.
string_lpad :: Returns a string of a specified length, left-padded with a given character.
string_lrepad :: Returns a string of a specified length, trimmed of leading and trailing blanks and
left-padded with a given character.
string_lrtrim :: Returns a string trimmed of leading and trailing blank characters.
string_ltrim :: Returns a string trimmed of leading blank characters.
string_pad :: Returns a right-padded string.
string_prefix :: Returns a substring that starts at the beginning of the parent string and is of the
specified length.
string_repad :: Returns a string of a specified length trimmed of any leading and trailing blank
characters, then right-padded with a given character.
string_replace :: Returns a string after replacing one substring with another.
string_replace_first :: Return+B14s a string after replacing the first occurrence of one substring
with another.
string_rindex :: Returns the index of the first character of the last occurrence of a string within
another string.
string_split :: Returns a vector consisting of substrings of a specified string.
string_split_no_empty :: Behaves like string_split, but excludes empty strings from its output.
string_substring :: Returns a substring of a string.
string_trim :: Returns a string trimmed of trailing blank characters.
string_upcase :: Returns a string with any lowercase letters converted to uppercase.
test_characters_all :: Tests a string for the presence of ALL characters in another string.
test_characters_any :: Tests a string for the presence of ANY characters in another string.

- what are the new features in abinito3.0?

- How to connect mainframe to Abinitio?

- How Does MAXCORE works?

- What is dynamic lookup ?

- I had 10,000 records r there i loded today 4000 records, i need load to 4001 - 10,000 next day
how is in Type 1 and how is it on type 2?

- Difference between paramaters


inputparameters,
Local parameters,
Formal parameters,
Sand box parameters,
Project parameters,
export parameter?

54) ERRORs :
==>
What does the error message “File table overflow” mean?
This error message indicates that the system-wide limit on open files has been exceeded. Either
there are too many processes running on the system, or the kernel configuration needs to be
changed.
What does the error message “Memory allocation failed (<n>) bytes or Failed to allocate <n>
bytes” mean?
This error message is generated when an Ab Initio process has exceeded its limit for some type of
memory allocation. The actual wording you see depends on your Co>Operating System version,
operating system, and other factors.
Three things can prevent a process from being able to allocate memory:
The user data limit (ulimit -Sd and ulimit -Hd). These settings do not apply to Windows systems.
Address space limit.
The entire computer is out of swap space.

What does the error message “Mismatched straight flow” mean?


This error message appears when you have two components that are connected by a straight flow
and running at different levels of parallelism. A straight flow requires the depths — the number
of partitions in the layouts — to be the same for both the source and destination.

What does the error message “Too many open files” mean?
This error message can occur for several reasons, but occurs most commonly when the value of
the max-core parameter of the SORT component is set too low. In these cases, increasing the
value of the max-core parameter solves the problem.

What does the error message “Trouble writing to socket: No space left on device” mean?
This error message means your work directory (AB_WORK_DIR) is full.

What does the error message “Remote job failed to start up” mean?
This situation typically arises if some of the components in the graph are configured to run on a
remote machine. If the communication between the Co>Operating System(specified in the
Connections dialog) and the remote machine is not set up properly, you could see this error
message.

55) Difference between a phase and checkpoint


==> Phases are used in case to use the resources such as memory, disk space, and CPU cycles for
the most demanding part of the job.Say, we have memory consuming components in the straight
flow and the data in flow is in millions,we can separate the process out in one phase so as the cpu
allocation is more for the process to consume less time for the whole process to get over.

Phase creates the intermediate/temporary file and delete it regardless of knowing whether the
graph runs successfully or not.Phase is used for performance tuning.It is useful to avoid
deadlocking.The boundary between the two blocks is known as phase break.

In contrary,Checkpoints are required if we need to run the graph from the saved last phase
recovery file(phase break checkpoint) if it fails unexpectedly.

Use of phase breaks which includes the checkpoints would degrade the performance but ensures
save point run.Toggling Checkpoints could be helpful for removing checkpoints from phase
break.

==> The major difference between these to is that phasing deletes the intermediate files made at
the end of each phase, as soon as it enters the next phase. On the other hand, what checkpointing
does is...it stores these intermediate files till the end of the graph. Thus we can easily use the
intermediate file to restart the process from where it failed. But this cannot be done in case of
phasing.
56) Graph Performance
==> Performance of graphs can be improved by employing the
following methods.
1:use data parallelism(but efficienty).
2:try to use less no of phases in graphs.
3:use component parallelism.
4:use component folding.
5:always use the oracle tuned query inside the input table
component this will give huge performance improvement.
6:Try to use as less as possible the components which does
not allows the pipeline parallelism.
7:Do not use huge lookups.
8:if data is not huge always use in memory sort option.

57) SCD
==> There are many approaches how to deal with SCD. The most popular are:

Type 0 - The passive method


Type 1 - Overwriting the old value
Type 2 - Creating a new additional record(Effective n Expiry logic)
Type 3 - Adding a new column
Type 4 - Using historical table

58) SKEW
== > Skew is the measure of the data flow on the particular partition

4 way partitioned
1flow---200recs
2flow---600recs
3flow---400recs
4flow---800recs

Take average = (200+600+400+800)/4 = 500


Skew on 1st flow=(200-500)/800 * 100= -3/8 * 100 =... -ve low skew
Skew on 4th flow=(800-500)/800 *100 = 3/8 * 100=....+ve more skew

SKEW = (#records in partition - avg records per partition)/#partition * 100 --> if it is -ve then
low skew else high skew

59) Records with EXTRA PIPE (awk unix)


user2@uhadoop01:~/tushar$ cat > 1
AAA|DD|FFF|FF
AA|DD|FF
DD|FF|SS
XX|AA|DD
CC|GG|SS|AA
^C
user2@uhadoop01:~/tushar$ awk -F "|" '{if(NF > 3) print FNR "::" $0;}' 1
1 :: AAA|DD|FFF|FF
5 :: CC|GG|SS|AA
user2@uhadoop01:~/tushar$ awk -F "|" 'BEGIN{OFS=":";} {if(NF > 3) print FNR , $1,$2;}' 1
1:AAA:DD
5:CC:GG

user2@uhadoop01:~/tushar$ ls -l |awk 'BEGIN{sum=0} {sum=sum+$5} END {print sum}'


30405696

user2@uhadoop01:~/tushar$ awk 'BEGIN{RS = "|";} {print $0}' 1


AAA
DD
.
.
GG
SS
AA

ARGC number of command line arguments.


ARGV array of command line arguments, 0..ARGC-1.
FILENAME name of the current input file.
FNR current record number in FILENAME.
FS splits records into fields as a regular expression.
NF number of fields in the current record.
NR current record number in the total input stream.
OFS inserted between fields on output, initially = " ".
ORS terminates each record on output, initially = "\n".
RS input record separator, initially = "\n".

58) JOIN driving PORT


== > But my doubt is how will you declare any inport as a driving port. Does abinitio provide u
any icons or any other means of doing so.or is the port in0 of the join considered as the driving
port.

2.Also what about the driving table concept.


if i am joining two tables then is it better to have two input tables..unload them and then join them
using a join component. or to write the select statement with join condition in the input table
itself.

1)In the join component if you specify the sorted-input parameter as "In-memory", then the
driving parameter will automatically appear in the component. You can specify port 0 or 1 or any
other as the driving port. This option only appears when you perform an
In-memory join. The non-driving port then gets loaded into memory. Do not assume that Port0
will be your driving port. It is the default one.

2) Regarding Point 2, it is always better to perform the join after unloading the records from the
Input table.
Try to minimize the database access as much as possible. Another suggestion would be that if one
of your tables is very small, then you can do a join with the help of Join with DB component. In
this case unload the larger table and perform the join with the smaller table with the help of the
Join with DB component.
59) EME(Enterprise Meta>Environment)
==> EME is an AbInitio repository and environment for storing and managing metadata.
EME metadata can be accessed from the Ab Initio GDE, web browser or AbInitio
CoOperating system command line (air commands)

60) RANK() - third largest sales from each store


==> SELECT storeid ,sales,RANK() OVER (PARTITION BY storeid ORDER BY sales
DESC) AS "Ranking"
FROM salestbl QUALIFY Ranking = 3;

61) QUALIFY() -
==> QUALIFY ROW_NUMBER() OVER (PARTITION BY Emp_Name ORDER BY
Emp__NR DESC) = 1

62) Performance Improvement LOADING into DB

We had a similar scenario wherein we had to update about 200,000+


records in a dimension table daily and the updates used to run for about
an hour even after tuning the queries and creating all the required
partitions and indexes on the table, also it used to cause table locking
problem while updating in parallel.

To overcome this we did the following and observed more than 50%
improvement in performance:

1. introduce additional column in dimension_table, say ACTIVE_FLAG and


rename the dimension_table to dimension_table_base and create and index
on ACTIVE_FLAG column.
2. create a view named dimension_table on dimension_table_base to show
only the records that have active flag as 'Y'.
3. instead of updating the records, update just the ACTIVE_FLAG of old
record to 'N' and insert a new row with new values.
4. create a monthly job to delete the records with ACTIVE_FLAG =3D 'N' and
defragment, rebuild the table and indexes.

You can try out this approach (if numbers of updates are too high) given
that you can make the database changes and it does not impact the other
applications.

Do let me know if you need more help on this.

63) Different Keys in Same Lookup File NEED RnD.


==> In your current approach, you can create 1 lookup file instead of 2.
i/p --> lookup file 1.
For the other lookup, please use "lookup template component" and provide the dml and
the required key.
In your transform component, please define and call your lookups dynamically using
lookup statements like (lookup_load ,lookup_identifier..etc).
You can also explore on using catalogs for this scenario.

64) Get 1st and last N 2nd and 2nd last records in the group sum
A 200
A 400
A 60
A3
A7

o/p :
A 207
A 403

==>
i/p --> rollup --> normalize --> o/p

rollup:

let decimal(2)[int] amt_arr=allocate();


let decimal(2)[2] sum=allocate();

rollup(in)=
begin
amt_arr=accumulation(in.amount);
sum[0]=amt_arr[0] + amt_arr[length_of(amt_arr)-1]
sum[1]=amt_arr[1] + amt_arr[length_of(amt_arr)-2]
out.id :: in.id;
out.sum_arr :: sum;
end;

Normalize:

out::length(in) begin
out :: 2;
end;

65) Abort running graph


==>
To abort a running job from shell:
m_rollback -k graphname.rec
m_kill jobname

==>
To delete .rec file at shell: m_rollback -d graphname.rec
"m_rollback -d" restores datasets to their state at job start and deletes .WORK files
WARNING: rolling back old .rec files will restore output files to old state
Clean up .rec files with m_rollback right away

66) SD2 logic


==>
UPDATE TABLE :
Update SQL : update ABC_TBL set A=:a,B=:b,C=:c where D=:d;
Insert SQL : insert into ABC_TBL values(:a,:b,:c,:d);
67) I have a file as follows with 4 records
id code name cost
1 4 xxx 24.25
1 5 yyy 20.00
2 8 aaa 10.00
2 9 bbb 20.00

Output:->

1 4 xxxx 44.25
1 5 yyyy 44.25
2 8 aaaa 30.00
2 9 bbbb 30.00

==>
1. Sort the data based on first field ( id)
2. Replicate the data
a. Flow 1 : connect to Rollup component with ID as key field and sumup the cost using sum(cost)
function Connect the outport of Rollup to Join component ( in0).
b. Flow 2 : Connect the Flow 2 to join component ( in1).
3. Join the the above two flows using ID as key.

68) Column to Row


==> i/p --> rollup(concatenate)--> normalize --> o/p
Use rollup in expanded mode with {} as key and declare two fields in temporary type. Initialize
them to empty string (""). In rollup transform concatenate the input field to its corresponding
temporary field. Define vector of string type in output dml of rollup. Then in finalize just append
both temporary type fields into this vector output field. Then Use normalize to flatten the vector
into multiple records.

68)
Input file
Col1 col2
1A
2B
3C
4D

and desired output should be like below:


col1 1 2 3 4
col2 A B C D

==> i/p --> Rollup(concatenate WITH KEY {}) --> Col1 Col2
1234 ABCD
Meta Pivot --> o/p
name_field :: col_nm;
value_field :: col_val;

==> i/p --> Meta Pivot --> o/p


name_field :: col_nm;
value_field :: col_val;

o/p :
Col1 1
Col2 A
Col1 2
Col2 B
Col1 3
Col2 C
Col1 4
Col2 D

Year : 2012
GDE : 3.0.4
Co-Op: 3.0.3.9

Year : 2015
GDE 3.2.2
Co>Operating System 3.2.2

S1) Tags :
==> There are two types of tag,
- Regular Tags(vtags) : a tag contains list of objects and associated version
numbers
- Technical Repository Tag(rtags)

S2)
==> Perform the following steps to move ur code from DEV EME to PROD EME.

1. Save the Graphs and Deploy the Graphs as Script. This will update ur .ksh file.
2. Checkin the Graphs and other necessary objects (dmls, xfrs.......). This will put the updated
things in the EME.
3. Create a tag (say xyz) for the graphs / dmls / xfrs that u want to move to other EME. Use the
command
air tag create xyz /Projects/DEV/ENV/mp/load_dw.mp
4. Create a .save file using the commmand --
air promote save <.save file name> <Tag name created in last step>.
5. Load the .save file in the PROD / QA EME using the following cmd -
air promote load <name of .save file created in last step> -relocate <destination
Project Path>.
Putting an example --
air promote load xyz.SAVE -relocate /Projects/PROD/ENV
6. Now checkout the Graphs and other files like dmls, xfrs.......from the PROD Environment to ur
sandbox (If required).

S3) MATERIALIZED View :


==> A VIEW is nothing but a SQL query, takes the output of a query and makes it appear like a
virtual table, which does not take up any storage space or contain any data

But MATERIALISED VIEWs are schema objects, it storing the results of a query in a separate
schema object(i.e., take up storage space and contain datas). This indicates the materialized view
is returning a physically separate copy of the table data.Get refreshed periodically as mentioned
in the MView creation statements

==> Materialized view provides indirect access to table data by storing the results of a query in a
separate schema object.
The existence of a materialized view is transparent to SQL, but when used for query
rewrites will improve the performance of SQL execution.
An updatable materialized view lets you insert, update, and delete.
You can define a materialized view on a base table, partitioned table or view and you can
define indexes on a materialized view.
A materialized view can be stored in the same database as its base table(s) or in a
different database.

SNAPSHOT : Materialized view create on defferent db other then db where base table
resides.

CREATE MATERIALIZED VIEW view-name


BUILD [IMMEDIATE | DEFERRED] // IMMEDIATE : The materialized view is
populated immediately. DEFERRED : The materialized view is populated on the first requested
refresh.
REFRESH [FAST | COMPLETE | FORCE ] // FAST : A fast refresh is attempted.
COMPLETE : The table segment supporting the materialized view is truncated and repopulated
completely using the associated query. FORCE : A fast refresh is attempted. If one is not possible
a complete refresh is performed.
ON [COMMIT | DEMAND ] // ON COMMIT : The refresh is triggered by a committed
data change in one of the dependent tables. ON DEMAND : The refresh is initiated by a manual
request or a scheduled task.
[[ENABLE | DISABLE] QUERY REWRITE]
[ON PREBUILT TABLE]
AS
SELECT ...;

Ex. CREATE MATERIALIZED VIEW emp_mv BUILD IMMEDIATE


REFRESH FORCE ON DEMAND AS SELECT * FROM emp@db1.world;

S4) Create Private/Public Project


Here is the command to create a project xxxxxxx.

for Private proj-----> air project create /abinitio/Projects/xxxxxx -prefix AI_


for Public proj-----> air project create /abinitio/Projects/xxxxxx -prefix COMMON_

Note: /abinitio/Projects/ -- is the path where the project resides in EME. Pls use your path details.

==> 1. Public projects will contain interface information, for example, metadata in the shape of
DML, dbc connections etc
- Anything to do with an interface.
2. Private projects will contain graphs, the application, anything that executes and/or processes
data
3. Special public projects - These are stdenv and a local-environment project
4. Obtain the Standard>Environment Guide and Reference, have a read

Worth noting that when creating public projects via the Standard>Environment you will be
prompted for a project prefix, whilst private projects will use the AI_ prefix. (and for those in
Boston - roll on the Enterprise View as the enhancement to
AB_ALLOW_PROJECT_OVERVIEW - it would make explaining the relationships easier)

==> There is a special Project associated with every instance of Ab Initio environment known as
the Environment Project or stdenv. This is no different from a regular Project in the structure. It
contains machine and Application specific settings like the data directory mount points, max-core
settings and application wide parameters like current date, which are used across all Projects.
During creation of any Project, stdenv is included in it by default. A single stdenv is required for
an entire set
of applications on a single machine and sharing a single EME Datastore.
=>---------------------------------
mdp-gcde, brdng, lc,
frd- recing ip fil from frd monitoring sys nd as per busns req popu some ind like CP/CNP frd(lkp
CP tbl, else calc from other filds code, else CNP), case tbl, dispt, adjstmnt
mnfrm conv-undrstnd n dscs the logical flow dig provided by cbl tm, 1nc cnfrm strt dev, tst, cmpr
rslt,
=
thrd pty-ext dt frm db n msk it, actnbr tb,ld tb to whch thrd pty hv acs
adc-rplc ftp wd sftp
anls-
=>--------------------------------
S5) SET vs MULTISET in Teradata
==>
In Teradata ,tables created are of two types.
They are SET and MULTISET
SET tables are the tables which do not allow duplicates.

By Default table created would be a SET table.


A SET table would create additional effort to the teradata to check whether the inserted record is
a duplicate or not.

Instead one can create a MULTISET table and can use the function QUALIFY or GROUP BY to
inorder not to get the duplicate records.

On the other way , if a Unique Primary index is defined in the table then there is no need to use
the SET table since the unique primary index would not allow duplicates.

When we use INSERT INTO SELECT FROM clause , the SET table check for duplicate check
will be removed and hence there will not be a DUPLICATE ROW ERROR

When we try to INSERT INTO TABLE VALUES clause then the SET table check for duplicate
rows will not be removed and hence when a duplicate row is appeared it will give DUPLICATE
ROW ERROR

S6) UPDATE subquery


==>
UPDATE cust_detail_tbl main
FROM(
SELECT account_nbr, cust_name FROM cust_tran GROUP BY 1,2
)tmp
SET main.cust_nm = tmp.cust_name
WHERE main.acct_nbr = tmp.account_nbr;

==>
UPDATE C
FROM CUSTOMERS C, RESTAURANTS R
SET LIKES_US = 'Y'
WHERE C.LINK_ID = R.LINK_ID AND R.REST_TYPE = 'DINER' AND C.LIKES_US IS
NULL

S7) Types of VIEWs


==>
Based on the functionality, there are 3 types of views
Read only view
Updatable view
Materialize view

S8) UNIX file permission


==>
1 – execute only :: 2 – write only :: 4 – read only
U(user-owner of file)G(group)O(other)
S9) Sort files in directory alphabatically
==> ls -l|sort -k9

S10) sed Command


==>
Delete lines other than the first line or header line :: > sed '1!d' file
Delete lines other than the last line or header line :: > sed '$!d' file
Delete first and last line :: > sed '1d;$d' file
Delete empty lines or blank lines :: > sed '/^$/d' file
Delete lines that Begin/End with specified character :: > sed '/^u/d' file :: > sed '/x$/d' file
Delete lines starting from a pttrn till the last line :: > sed '/fedora/,$d' file
Delete last line only if it contains the pattern :: > sed '${/ubuntu/d;}' file

S11) find Command


==>
find for files using name and ignoring case :: find -iname "sum.java"
find file in all directories recurcively :: find -name "1*"
find the files whose name are not "sum.java" :: find -not -name "sum.java"
print the files in the current directory and one level
down to the current directory :: find -maxdepth 2 -
name "sum.java"
find the largest file in the current directory
and sub directories :: find . -type f
-exec ls -s {} \; | sort -n -r | head -1
Finding directories ::
find . -type d
Finding files larger/less than 10M size :: find . -size +10M :: find . -
size -10M
find the files based on the file permissions :: find . -perm 777
Find the files which are modified within 30 minutes :: find . -mmin -30 :: find . -amin -60
(accessed within 1 hr)
find . -cmin -120 (chnaged within 2 hrs)
find the files which are created between two files :: find . -cnewer f1 -and ! -cnewer f2
You can remove the blank lines using the grep command :: grep -v "^$" file.txt

awk '{if(NR%2 == 1) {sum_e = sum_e + $2 } else {sum_o = sum_o + $2 }} END { print


sum_e,sum_o }' num

S12) Fibbonacci series ::


==> awk 'BEGIN{ for(i=0;i<=10;i++)
{ if(i<1) { prev=0; curr=1; print i; }
else { fibb=prev+curr; curr=prev;prev=fibb; print
fibb;}}}'

S13)
==> Replicate : It will put all the data in the input to all the data at output.
You can not have serial layout(SFS) to input and MFS to output.
Broadcast : It can have SFS at in and MFS at out.

S14) SnowFlake example:


==> DIM_PRODUCT(ID,Product_name,Brand_id,Product_Catagory_id)
==> DIM_BRAND(Band_id,Brand_name)
==> DIM_PROD_CATAGORY(Product_Catagory_id,Product_Catagory_Name)

S15) Check out old versions


==> air -version <version number> project export <rpath> -basedir <sandbox path> -file <graph
name>

S16) Why Secondary Index is always 2 AMP?


==>
Unique Secondary Index : CREATE UNIQUE INDEX [COLUMN_NAME] ON
TABLENAME;
Non-Unique Secondary Index: CREATE INDEX [COLUMN_NAME] ON TABLENAME;

Whenever a SECONDARY index is created on table , a subtable is created on all the AMPs
which hold following information:
SECONDARY INDEX VALUE || SECONDARY INDEX ROW_ID || PRIMARY INDEX
ROW_ID

So whenever we query using column defined as SECONDARY INDEX, all AMP’s are asked to
check for their sub-table if they hold that value. If yes, then AMPs retrieve the corresponding
PRIMARY INDEX row_id from their subtable. Then the AMP holding the PRIMARY INDEX
row_id is asked to retrieve respective records. Hence, Data Retrieval via Secondary Index is
always 2 AMP or more AMP operation. For NUSI [Non Unique Secondary Index] subtable is
created in the same AMP holding the PRIMARY row_id. However for USI[Unique Secondary
Index], subtables hold the information about rows of different AMPs. Secondary Index avoids
FULL TABLE scan. However one should collect STATS on Secondary Index columns in order
to allow Optimizer to use Secondary Index and not Full Table Scan.

Advantages:
-Avoids FULL Table Scan by providing alternate data retrieval path.
-Enhances performances.
-Can be dropped and created anytime.
-A table may have multiple Secondary Index defined where as only one Primary Index is
permissible.

Disadvantages:
-Needs extra storage space for SUBTABLE.
-Needs extra I/O to maintain SUBTABLE.
-Collect STATS is required in order to avoid FULL TABLE SCAN.

DROP INDEX [COLUMN_NAME] ON TABLENAME;

S17)I/P : 1,2,2,3,3,4,4,5 output : 1,2 \n 2,3 \n 3,4 \n 4,5


==>
arr_Index Values Norm_Index
0 1 0
1 2 0
2 2 1
3 3 1
4 3 2
5 4 2
6 4 3
7 5 3

out::length(in)
begine
out::(length_of(string_split(in.data,","))/2); 4 times normalize function will run
end;

out::normalize(in,index)
begin
let decimal("")[int] arr = string_split(in.data,",");
let decimal("")[int] arr_tmp=allocate();

arr_tmp[0]=arr[index+index]; 1
arr_tmp[1]=arr[index+index+1]; 2

out.final :: string_join(arr_tmp,","); 1,2


end;

S18) Shared Nothing Architecture of Teradata


==> A shared nothing architecture (SN) is a distributed computing architecture in which each
node is independent, More specifically, none of the nodes share memory or disk storage.
In teredata the concept of AMPs makes it shared nothing architecture. The data which is in one
AMP is available only to that AMP and any operations relating to that data is taken care by that
AMP making it shared nothing architecture.

S19) Teradata Indexes


==> Support 5 type of Indexs
UPI & NUPI : Required for row distribution and storage.When a row is inserted, its hash
code is calculated using a hashing algorithm and stored it in respective AMPs
USI & NUSI : Whenever a SECONDARY index is created on table , a subtable is created
on all the AMPs which hold following information:
SECONDARY INDEX VALUE || SECONDARY INDEX
ROW_ID || PRIMARY INDEX ROW_ID
It is a 2 AMP process, 1 AMP - to read subTable and 2 AMP -
read base table using Primary Index RowID

S20) OLAP functions in Teradata


==> RANK, CSUM, QUALIFY

S21) Stored Procedure


==> CREATE PROCEDURE InsertSalary (IN in_EmpNo INTEGER,IN in_Gross INTEGER,IN
in_Deduction INTEGER,IN in_NetPay INTEGER,OUT out_EmpName)
BEGIN
INSERT INTO Salary ( EmployeeNo,Gross,Deduction,NetPay)
VALUES (:in_EmpNo,:in_Gross,:in_Deduction,:in_NetPay);

SELECT Emp_name INTO out_EmpName FROM emp_details where emp_id=:EmployeeNo;


END;

CALL InsertSalary(105,20000,2000,18000,out_EmpName);
S22) Teradata - Performance Tuning
==> 1.Collect stats 2.Primary Index 3.Secondary Index 4.Load temp tables for sub queries
5.avoid ORDER by

S23) Sed command “i” is used to insert a line before every line with the range or pattern.
Sed command “a” is used to insert a line after every line with the range or pattern.
==>
http://www.thegeekstuff.com/2009/11/unix-sed-tutorial-append-insert-replace-and-count-
file-lines/?ref=driverlayer.com
sed '3 a\
Cool gadgets and websites' filename.txt ==> Add a line after the 3rd line of the
file
sed '/Sysadmin/a \
Linux Scripting' filename.txt ==> Append a line after every line matching the
pattern

$ sed '$ i\
> Website Design' thegeekstuff.txt ==>Insert a line before the last line of the file.
$ sed '4 i\
> Cool gadgets and websites' thegeekstuff.txt ==> Add a line before the 4th line
of the line
$ sed '1 c\
> The Geek Stuff' thegeekstuff.txt ==> Replace a first line of the file
$ sed '/Linux Sysadmin/c \
> Linux Sysadmin - Scripting' thegeekstuff.txt ==> Replace a line which matches
the pattern
sed '2 s/Fedora/BSD/' fedora_oveview.dat ==> To replace "Fedora" with "BSD"
on second line
sed 's/^.//' file ==> To remove 1st character in every line

You might also like