You are on page 1of 34
A practical guide to learning GNU Awk gee Us Sena) What is Opensource.com? 0 PEN SO U RCE CO publishes stories about creating, . adopting, and sharing open source solutions. Visit Opensource.com to lear more about how the open source way is improving technologies, education, business, government, health, law, entertainment, humanitarian efforts, and more. Submit a story idea: opensource.com/story Email us: open@opensource.com open source .com Supported by RedHat 2 APRACTICAL GUIDE TO LEARNING GNU AWK ! CC BY-SA4.0 OPENSOURCE.COM TT ae OS Ua SES Sada Sel a eM aN Se) seta LG SETH KEN LO N is an independent multimedia artist, free culture advocate, and UNIX geek. He has worked in the film and computing industry, often at the same time. He is one of the maintainers of the Slackware-based multimedia production project, http://slackermedia.info, DAVE MO RRISS is a retired IT Manager now contributing to the “Hacker Public Radio” community podcast (http://hackerpublicradio.org) as a podcast host and an administrator. ROBERT YOU NG is the Owner and Principal Consultant at Lab Insights, LLC. He has led dozens of laboratory informatics and data manage projects over the last 10 years. Robert Holds a degree in Cell Biology/Biochemistry and a masters in Bioinformatics. eT Ours dim Hall Lazarus Lazaridis Dave Neary Moshe Zadka APRACTICAL GUIDE TO LEARNING GNU AWK ! CC BY.SA4.0 ! OPENSOURCE.COM SS CHAPTERS eT What is awk? 5 Getting started with awk, a powerful text-parsing tool 6 Fields, records, and variables in awk 8 A guide to intermediate awk scripting W How to use loops in awk 13 How to use regular expressions in awk 15 4 ways to control the flow of your awk script 18 PRACTICE Advance your awk skills with two easy tutorials al How to remove duplicate lines from files with awk 24 Awk one-liners and scripts to help you sort text files 26 A gawk script to convert smart quotes 29 Drinking coffee with AWK 31 CHEAT SHEET GNU awk cheat sheet 33 4 APRACTICAL GUIDE TO LEARNING GNU AWK ! CC BY-SA4.0 ? OPENSOURCE.COM What is awk? awk is known for its robust ability to process and interpret data from text files. AVVK 3 8.proazamming language and a POSH [1 specification that originated at AT&T Bell Lab- ‘oratories In 1977. Its name comes from the initials of its designers: Ano, Weinberger, and Kernighan. awk features User-defined functions, multiple input streams, TCP/IP networking access, and a rich set of regular expressions. Ws often used to process raw tex! files, interpreting the data it finds as records and fields to be manipulated by the user. At its most basic, awk searches files for some unit of text (usually lines terminated with an end-of-line character) con- taining some user-specified pattem. When a line matches ‘one of the patterns, awk performs some set of user-defined actions on that lin, then processes input lines until the end of the input fies. awk is used as a command as often as it Is used as an Interpreted script. One-liners are popular and useful ways of filtering output from files or ouput streams or as stand- alone commands. awk even has an interactive mode of sorts because, without input, it acts upon any line the user types Into the terminal: § auk '/Fo0/ { print toupper(sa): 3° This ine contains bar. This Tine contains Fao. THIS LINE CONTAINS FD. However, awk is a programming language with user-defined ‘unctions, loops, conditionals, flow control, and more. It's ro- bust enough as a language that it has been used to program wiki and even (believe it or no!) a retargelable assembler {or eight-bit microprocessors. Why use awk? awk may seem outdated in a world fortunate enough o have Python available by default on several major operating sys- tems, but is longevity is wel-earned. In many ways, pro- ‘grams written in awk are diferent from programs in other languages because avk's data-driven, Thats, you describe APRACTICAL GUIDE TO LEARNING GNU AWK ! CC BYSA4.0 to awk what data you want to work with and then what you \want ito do when such data is found. There are no boller- plate constructors to create, no elaborate class structure to design, no stream objects to create. awk is built fora specific purpose, so there's a lot you can take for granted and allow awk to handle. What's the difference between awk and gawk? ‘Awkis an open source POSIX specification, so anyone can {in theory) implement a version of the command and lan- guage. On Linux or any system that provides GNU awk (2), the command to invoke awk is gawk, but it’s symlinked to the generic command awk. The same Is true for sys- tems that provide nawk or mawk or any other variety of awk implementation. Most versions of awk implement the core functionality and literal functions defined by the POSIX spec, although they may add special new features not present in others. For that reason, there's some tisk of learning one implementation and coming to rely on a special feature, but this ‘problem” is tempered by the fact that most of them are open source, so they usually can be Installed as needed. Learning awk ‘There are many great resources for leaming awk. The GNU awk manual, GAWK: Eflectve awk programming [3], is a defintive guide to the language. You can find many other tutorials for awk [4] on Opensource.com, including "Getting started with aw, a powerultex-parsing fo.” [5] Links 1] https iopencource.comlarticle/t@/7iwhat-posirichard- stallman-explains 2] httes:amww.gnu.orgtsoftwareigawk! [B]_https:awww.gnu.crgisottwarsigawkimanuel? [4] httpsiopensource.com/sitawide-search?search._ api . views fullext=awk [5] https:opensource.com/articlo/19/10/nto-awk OPENSOURCE.COM GETTING STARTED WITH AWK, A POWERFUL TEXT-PARSING TOO! Getting started with awk, a powerful text-parsing tool Let's jump in and start using it AWK IS A Siler parang too for Uni an Unixitke systems, but because it has programmed functions that you can use to perform com- mon parsing tasks, it's also considered a programming language. You probably won't be developing your next GUI application with awk, and it kely won't take the place of your default scripting language, but i's @ powerful utlity for specific tasks. What those tasks may be is surprisingly diverse. The best way to discover which of your problems right be best solved by awk is to lear awk; you'll be surprised at how awvk can help you get more done but with alot less effort. ‘Awk’s basle syntax is: ‘uk (options] ‘pattern {action}' fle ‘To get started, create this sample fle and save it as colours.txt nave color anount apple reds banana yellou & strawberry red 3 grape purple 1 apple green 8 plur purple 2 kisi broun 4 potato broun 9 pineapple yellou 5 ‘This data Is separated into columns by one or more spac- es. It's common for data that you are analyzing to be organized in some way. It may not always be columns separated by whitespace, or even a comma or semico- lon, but especially in og files or data dumps, there's gen- erally a predictable pattern. You can use patterns of data to help awk extract and process the data that you want to focus on Printing a column In awk, the print function displays whatever you specify. ‘There are meny predefined variables you can use, but some ‘of the most common are integers designating columns in a text fle, Ty itout # suk ‘(print 2:1" colours. tut color red yellow red purple green purple braun braun yellow In this case, awk displays the second column, denoted by ‘$2. This is relatively intuitive, so you can probably guess that print $1 displays the first column, and print $3 displays the thitd, and 80 on, ‘To display allcolumns, use $0. The number after the dollar sign (S) is an expression, 50 $2 and S(141) mean the same thing Conditionally selecting columns ‘The example file you're using Is very structured. Ithas a row that serves as a header, and the columns relate dlrectly 10 ‘one another. By defining conditional requirements, you can dually what you want awk to return when looking at this data, For instance, to view items in column 2 that match ‘ye low" and print the contents of column 1: uke "82 banana pineapple yel lou'fprint $1)’ colours. txt 6 APRACTICAL GUIDE TO LEARNING GNU AWK ! CC BY-SA4.0 OPENSOURCE.COM CSTR CRSA ea ak nO 2 ae SL Regular expressions work as well. This conditional looks at $2 for approximate matches to the letter p followed by any number of (one or more) characters, which are in turn {followed by the letter p: # auk "62 ~ /p.4p/ print $9)" colours. txt grape purple 10 plun purple 2 Numbers are interpreted naturally by awk. For instance, to print any row with a third column containing an integer greater than 5: auk ‘435 fprint $1, 42]! colours. txt rane color banana yellow grape purple apple green potato braun Field separator By default, awk uses whitespace as the fleld separator. Not all text fles use whitespace to define fields, though. For example, create a file called eolours.esv with this content: rane,coler, axount apple,red,4 banana,yel 104.6 strauberry,red,3 grape, purple, 18 apple, green,8 plun,purple,2 kiwi, brown, ¢ potato,broun, 9 pineapple, yel lou,5: ‘Aw can treat the data in exactly the same way, as long as you specity which character it should use as the field separa- {orin your command. Use the ~field-separator (or ust -F for short) option to define the delimit $ auk FY," "420~"yellou" {print #1)" filet.esv banana pineapple Saving output Using output redirection, you can write your results toa file For example: 4 auk -F, "4395 {print 41, $2) colours.csv > output.txt, This creates a fle with the contents of your awk query. You can also split a file into multiple files grouped by col- umn data. For example, it you want to split colours.txt into ‘mutiple files according to what color appears in each row, you can cause awk to redirect per query by including the redirection in your awk statement: $ ak “Eprint > $2°.txt"}* colours.txt This produces files named yellow.txt, red.txt, and so on, A PRACTICAL GUIDE TO LEARNING GNU AWK § CC BY:SA4.0 } OPENSOURCE.COM 7 GS Aad sd Sead Fields, records, and variables in awk In the second article in this intro to awk series, learn about fields, records, and some powerful awk variables, AWK COMES tacveraivriaties: tere the original awk, written In 1977 at ATT Bell Laboratories, and several reimplementations, such as mawk, nawk, and the one that ships with most Li- ux distributions, GNU awk, or gawk. On most Linux dis- tributions, awk and gawk are synonyms referring to GNU awk, and typing either invokes the same awk command. ‘See the GNU awk user's guide [1] for the full history of awk and gawk. ‘The frst article in this series showed that awk Is invoked (on the command line with this syntax: $ auk [options] “pattern {action}* inputfle ‘Awk is the command, and it can take options (such as -F to define the field separator). The action you want awk to Perform is contained in single quotes, at least when i's Issued in a terminal. To further emphasize which part of the awk command Is the action you want it to take, you can precede your program with the -@ option (but it's not required): # auk Fy -e ‘forint yellow blue green [el AT) colours. txt Records and fields ‘Awik views Its input data as a series of records, which are usually newiine-delimited lines. In other words, awk general- ly sees each line In a text fle as a new record. Each record contains a series of fields A field Is a component of a record delimited by a fleld separator. By default, awk sees whitespace, such as spaces, tabs, and newlines, as indicators of a new field. Specifically, awk treats multiple space separators as one, so this ine contains two felds: raspberry red ‘As does this one: tuxedo black Other separators are not treated this way. Assuming that the field separtor is a comma, the folowing example record con tains three fields, with one probably being zero characters long (assuming a non-printable character isnt hiding in that fel): ad The awk program ‘The program part of an awk command consists of a series of rules. Normally, each rule begins on a new line in the pro- ‘ram (although this is not mandatory). Each rule consists of a pattem and one or more actions: pattern { action 3 Ina rule, you can define a pattern as a condition to control ‘whether the action will run on a record. Patterns can be sim- ple comparisons, regular expressions, combinations of the two, and more. For instance, this will print a record only i t contains the word “raspberry” # auk ‘/raspberry/ { print 9 }' colours.tut raspberry res 99 It there is no qualifying patter, the action is applied to every record, 8 APRACTICAL GUIDE TO LEARNING GNU AWK ! CC BY-SA4.0 OPENSOURCE.COM Also, a rule can consist of only a pattern, in which case the ‘entire record Is written as i the action was { print). ‘Aw programs ate essentially data-driven in that actions depend on the data, so they are quite a bit different trom programs in many other programming languages. The NF variable Each fleld has a variable as a designation, but there are special variables for flelds and records, too. The variable NF stores the number of folds awk finds in the current record. ‘This can be printed or used in tests. Here is an example us- Ing the text fle [2] trom the previous article: $ auk {print 43°" (" NF)" 2 colours. txt rane color _anount (3) apple red 4 (3) banana yellow 6 (3) (el ‘AWk's print function takes a series of arguments (which may be variables or strings) and concatenates them to- gether. This is why, atthe end of each line inthis example, awk prints the number of fields as an integer enclosed by parentheses. ‘The NR variable In addition to counting the fields in each record, awk also ‘counts input records. The record number Is held in the vari- able NR, and it can be used in the same way as any other variable. For example, to print the record number before each line: $ auk “{ print WR": ” $9 3° cotours.txt 1: ane color amount. 2% apple red 2: banana yellow 6 45 raspberry red 3 5: grape purple 10 L -] Note that its acceptable to write this command with no spac- ‘es olher than the one alter print, although it's more difficult for a human to parse: § auk "forint NR": "68}" colours. txt The printf() function For greater flexibly In how the output Is formatted, you can use the awk printf() function. This is similar to printf in C, Lua, Bash, and other languages. It takes @ format argument followed by a comma-separated list of items. The argument lst may be enclosed in parentheses. $ printf foruat, itenl, iten2, .. APRACTICAL GUIDE TO LEARNING GNU AWK ! CC BY:SA4.0 | OPENSOURCE. ahh A ali ed eee The format argument (or format string) defines how each of the other arguments will be output. It uses format specifiers {0 do this, including %s to output a string and %éd to output a decimal number. The following printf statement outputs the record followed by the number of fields in parentheses: # auk ‘printf "Ss (0) 0" $0, NFP colours. txt ane color anount (3) raspberry red 4 3) banana yellou & (3) bel In this example, %s (%d) provides the structure for each line, while $0,NF defines the data to be inserted into the %s and %d positions. Note that, unlike with the print function, ‘no newline Is generated without explicit instructions. The es- cape sequence \n does this. ‘Awk scripting Alf the awk code in this article has been writen and exe- cuted in an interactive Bash prompt. For more complex pro- grams, I's often easier to place your commands ino a fil or script. The option - FILE (not to be confused with -F, which denotes the field separator) may be used to invoke a file containing a program. For example, here Is a simple awk script, Create a fle called example'.awk vith this content: Jal {print “A: "$03 Job) fprint "2: " $0} Is conventional to give such flles the extension awk 10 ‘make it clear that they hold an awk program, This naming Is not mandatory, but it gives fle managers and editors (and you) a useful clue about what the fle is. Run the script: suk -F exanplel.auk colours. tet raspoerry red 4 banana yellou & apple green 8 A ile containing awk instructions can be made into a script by adding a #! line at the top and making it executable. Cre- ate a fie called example2.awk with these contents: ‘#1Lusr/invask -F + Print off but Tine 1 with the tine nunber on the Front Moe . printf Sd: S\n" NR, $8 ) . SOM 9 FIELDS, RECORDS, AND VARIABLES IN AWK ‘Arguably, there’s no advantage to having just one line in a script, bul sometimes i's easier to execute a script than to remember and type even a single line. A script file also provides a good opportunity to document what a command does. Lines starting with the # symbol are comments, which awk ignores. Grant the fle executable permission: 4 chuod ex exanple2. ak Run the script: +. /exanple2.auk colours. txt apple red banana yellow 6 raspberry red 3 grape purple 18 1 ‘An advantage of placing your awk instructions in a script fle is that t's easier to format and edit. While you can write awk on a single line in your terminal, it can get overwhelming “when it spans several lines. Tryit You now know enough about how awk processes your in- structions to be able to write a complex awk program. Try writing an awk script with more than one rule and at least ‘one conditional pattern If you want to try more functions than just print and printf, refer to the gawk manual [3] online, Here's an idea to get you started: #1Jusr/bin/auk ~f # Print each record EXCEPT (IF the first record contains “raspberry”, # THEW replace "red" with “pi” $1 "raspberry" £ gsub(/red/, y {print 3 ‘Try this script to see what it does, and then try to write your own. Links [1] httostww.gnuorg/software/gawkimanualtml_ node! History.himisHistory [2] https:opensource.con/articlo/19/1O/intro-awk [3] https:iemw.gnu.org/softwarefgawkimanuall 10 APRACTICAL GUIDE TO LEARNING GNU AWK ! CC BY-SA4.0 ? OPENSOURCE.COM eT aisle ea ea A guide to intermediate awk scripting Learn how to structure commands into executable scripts. LIS ARTICLE si ess ecutable script. Logical operators and conditionals You can use the logical operators and (writen &&) and or (written II) to add specificity to your conditional. For example to Select and print only records withthe sting “purple” in the second column and an amount less than five in the third column: $2 = "purple" && $3.5 {print $1} |W a record has “purple” in column two but a value greater than or equal to 5 in column three, then it is not selected. Similarty, Ifa record matches column three's requirement but lacks “purple” in column two, itis also not selected. Next command Say you want to select every record in your fle where the amount is greater than or equal to eight and print a matching record with two asterisks ("*). You also want to flag every record with a value between five (Inclusive) and eight with ‘only one asterisk (*). There are a few ways to do this, and ‘one way is to use the next command to instruct awk that after it takes an action, it should stop scanning and proceed to the nextrecord. Here's an example: mast print $05 nests y sree printe "ss\txs\n’, $0, "#4" next; 3 Borst print? "ss\tse\n" next; } seo Best print #3 } BEGIN command ‘The BEGIN command lets you print and set variables before awk starts scanning a text file, For instance, you can set the Input and output field separators inside your awk script by defining them in @ BEGIN statement. This example adapts the simple script trom the previous article for a fle with flelds delimited by commas instead of whitespace: #1 /usr/oin/auk -F 4 Print each record EXCEPT # IF the first record contoins 4 THEN replace “red” with “pi” raspberry”. BEGIN { res } aspberry’ { . gsub(/red/, 91") } END command ‘The END command, like BEGIN, allows you to perform actions in awk after It completes its scan through the text A PRACTICAL GUIDE TO LEARNING GNU AWK § CC BY:SA4.0 } OPENSOURCE.COM 1" ere Sar sala eae aa file you are processing. If you want to print cumulative re- sults of some value in all records, you can do that only after all records have been scanned and processed. ‘The BEGIN and END commands run only once each. All rules between them run zero or more times on each record. In other words, most of your awk script is @ loop that is exe- cuted at every new line ofthe text file you're processing, with the exception o! the BEGIN and END rules, which run before and after the loop. Here is an example thet wouldn't be possible without the END command, This script accepts values from the output of the f Unix command and increments two custom variables. (used and available) with each new record 41 fe "tents" £ used = 3; available a EN £ printf "xd GiB used\nxd GiB aval lable\n", sed/2"28, avai lable/2"22: i Save the script as total.awk and try it: GF =1 | auk -f total. uk ‘The used and available variables act lke variables in many ‘other programming languages. You create them arbitrarily and without declaring their type, and you ad values to them at will At the end of the loop, the script adds the records in the respective columns together and prints the totals, Math ‘As you can probably tell from all the logical operators and casual calculations so far, awk does math quite naturally. ‘This arguably makes it a very useful calculator for your terminal. Instead of struggling to remember the rather un- sual syntax of be, you can just use awvk along with its special BEGIN function to avoid the requirement of a file argument: $ auk ‘BEGIN { print 2421 } 2 $ auk "BEGIN {print 8*109(4) J" 11.0908 Admittedly, that's stil a lot of typing for simple (and not so simple) math, but it wouldn't take much effort to waite a fron tend, which is an exercise for you to explore. 12 APRACTICAL GUIDE TO LEARNING GNU AWK ! CC BY-SA4.0 ? OPENSOURCE.COM lS ase How to use loops in awk Lear how to use different types of loops to run commands on a record multiple times. AWK SCRIPTS time tran,man sectors: te optional BEGIN and END func- tions and the functions you write that are executed on each record. In a way, the main body of an awk script is a loop, because the commands In the functions run for each record. However, sometimes you want to run commands on a record more than once, and for that to happen, you must wre a loop. ‘There are several kinds of loops, each serving a unique purpose. While loop A while loop tests @ condition and pertorms commands while the test retuins true. Once a test returns false, the loop is broken. sIs0in/auk -F BEGIN { # Loop through 1 to 18 untle (1 18) £ print i, " to the second pouer is Ve tet d exits 1 In this simple example, awk prints the square of whatever Integer is contained in the variable The while (I <= 10) phrase tells awk to perform the loop only as long as the value of /is less than or equal to 10. After the final iteration (while / Is 10), the loop ends. Do while loop The do while loop performs commands atter the keyword do. It performs a test afterward to determine whether the stop condition has been met, The commands are repeated only while the test returns true (that is, the end condition has ‘not been met). It test fails, the loop is broken because the end condition has been met. #1 /usr/bin/auk -F BEGIN { print 1," to the second pouer i", feted } unite (1 ¢18) ents 3 For loops ‘There are two kinds of for loops in awk. (One kind of for loop initializes a variable, performs a test, ‘and increments the variable together, performing commands hile the testis true, #1/bin/ouk -F BEGIN { for ( 105 19) £ : print i, " to the second pouer is", i#iy 3 : exits . 3 wo Another kind of for loop sets a variable to successive indices of an array, performing a collection of commands for each index. In other words, it uses an array to “collect” data from a record. A PRACTICAL GUIDE TO LEARNING GNU AWK § CC BY:SA4.0 } OPENSOURCE.COM 13 Ll RS el aS ‘This example implements a simplified version of the Unix command unig, By adding alist of stings into an array called a as a key and incrementing the value each time the same key occurs, you get a count of the numberof times a sting ap- pears (ike the ~count option of unig). If you print the keys of the array, you get every string that appears one or more times. For example, using the demo file colours.txt (from the previous articles): rare color anount apple red banana —_yellou & raspberry red 99 stravberry red 3 grape purple 16 apple green 8 plun purple 2 kivi broun 4 potato —broun 9 pineapple yellou S Here is a slmple version of unig -¢ In awk form: 81 /use/binfauk -F NRE abides a ew for (key in a) { print afkey] " " key a ‘The third column of the sample data file contains the num- ber of items listed in the frst column. You can use an array and a for loop to tally the items in the third column by color: A /usr/bin/auk BEGIN { Fs OFSH\E"s print(‘color\tsun'}s 1 Wieig alsz}ns3; y END f for (bin a) £ print b, a(b] i 1 ‘As you can see, you are also printing a header column in the BEFORE function (which always happens only once) prior to processing the fle, Loops Loops are a vial part of any programming language, and awkis no exception, Using loops ean help you contol how Your awk script runs, what information its able to gather, and howit processes your dats. Our next article wll cover switch statements, continue, and next. 14 APRACTICAL GUIDE TO LEARNING GNU AWK ! CC BY-SA4.0 ? OPENSOURCE.COM HOW TO USE REGULAR EXPRESSIONS IN AWK How to use regular expressions in awk Use regex to search code using dynamic and complex pattern definitions. IN A\WK, ete exressione (gen alow to ay |, namic and complex pattern definitions. You're not limited to searching for simple strings but also patterns within patterns. ‘The syntax for using regular expressions to match lines inawk is: ord » /naten/ ‘The inverse of that is not matching a patter: ord I Jnateh/ It you haven't already, create the sample file from our previous article: ane color awount. apple rea banana yellow & strauberry rea 3 raspberry red 99 grape purple 16 apple grean 8 plun —purpte 2 iui brown 4 potato broun 9 Pineapple yellow 5 Save the file as colours.txt and run: $ suk -e “41 = /plel]/ {print $0)" colours. txt apple rec grape purple 12 apple green 8 plun purple 2 Pineapple yellow 5 ‘You have selected all records containing the letter p followed by efther an @ or ant. ‘Adding an o inside the square brackets creates a new pattern to match: $ auk -e ‘$1 » /plo]/ {print $@}' colours. txt apple red 4 grape purple 18 apple grean @ plu purple 2 pineapple yellow 5 potato —broun 9 Regular expression basics Certain characters have special meanings when they'e used in regular expressions. Anchors Anchor | Function * Indicates the beginning of the line $ Indicates the end of a line w Denotes the beginning of a string e Denotes the end of a string 3 Marks @ word boundary For example, this awk command prints any record contain- ing an r character: $ awk -e '51 © Ur forint $0) strauberry rod 3 raspberry red 99 grape purple 12 . colours. txt ‘Add a * symbol to select only records where r occurs at the beginning ofthe line: $ auk -e "51 » /*r/ Sprint $6)° colours. txt raspberry red 99 A PRACTICAL GUIDE TO LEARNING GNU AWK § CC BY:SA4.0 } OPENSOURCE.COM 16 HOW TO USE REGULAR EXPRESSIONS IN AWK Characters Character | Function ad] Selects a ord lad) Selects any character a through d (@,b,¢, or d) [rad] | Solocts any character excepta through d @ngn) w Selects any word e Selects any whitespace character a Selects any digit Tho capital versions of w, 8, and ¢ are negations; for exam- ple, \D does not select any digit. POSIX [1] regex offers easy mnemonics for character classes: POSIX Function mnemonic [alnum:] | Alphanumeric characters [:alpha:] | Alphabetic characters Many quantiiers modify the character sets that precede them. For example, . means any character that appears ex- actly once, but * means any or no character. Here's an ex: ample; look at the regex pattern caretully: § print “red\nrdia! red r6 $ printé red $ printé red ré red\nrdin" | auk -@ "48 » /*r.6/ print} ed\nrd\n" | auk -e ‘48 » /*r.44/ {print Similar'y, numbers in braces specify the number of times something occurs. To find records in which an e character ‘occurs exactly twice: $ auk -e apple 42 = /e{23/ {print $@}' colours. txt green 8 espace] | Space characters (such as space, tab, and formtees) Grouped matches [iblank:} | Space and tab characters ‘Quantitier | Function [:upper:] | Uppercase alphabetic characters (rea) Parentheses incite thal the enclosed [:lowers] | Lowercase alphabetic characters lettrs must appear contiguousy [eaigit]__ | Numeri characters 1 ‘Means or in the context of a grouped [xdigit: Characters that are hexadecimal digits match [:punet:] Punctuation characters (L.e., characters For instance, the pattern (red) matches the word red and that are not letters, cigs, contol ‘ordered but not any word that contains all three of those characters, of space characters) letters in another order (Such as the word order). [Eentri] | Contro! characters Tearaph:) | Characters that are bath printable and Awk like sed with sub() and gsub() Visible (eg. a space is printable butnot | AWK features several functions that perform find-and-eplace Visible, whereas an ais both) actions, much lke the Unix command sed. These are func- Tipriniy] | Printable characters (Le, characters tat | tons, just ike print and printt, and can be used in awk rules are net contol characters) to replace strings with a new string, whether the new string isa sting or a variable Quantifiers ‘The sub function substitutes the frst matches entity (in a ‘Quantitier [Function record) wih a replacement sting. For example, If you have . Matches any character this rule in an awk script: + Modifies the preceding set to mean { sub(vappter, “nw, 805 ‘one oF more tines men . ‘Modifies the preceding set fa mean ero or more times running it on the example file colours.txt produces this output: ? ‘Moai the preceding set fo mean zero or one time rane ‘") ‘Modis the preceding set to mean nut exacly n times tanana 1) ‘Mouifies the preceding set fo mean raspberry nor more tines strauberry tm) ‘ails the preceding set fo mean orepe between n and m times rut 16 APRACTICAL GUIDE TO LEARNING GNU AWK ! CC BY-SA4.0 OPENSOURCE.COM plun ii potato pinenut ‘The reason both apple and pineapple were replaced with nut is that both are the first match of their records. If the records were different, then the results could aiter: $ printf “apple apple\npineapple apple\n" | \ auk -e ‘suD(/apple/, “nut") nut apple inenut apple ‘The gsub command substitutes all matching items: # printf “apple apple\npineapple apple\n" | \ auk -e “gsub(/apple/, “out")’ rut nut pinenut nut Gensub An even more complex version of these funtion, called ensubd, is also avaiable ‘The gensub function allows you to use the & character to recall the matched text. For exemple, if you have a file wth the word Awk and you want to change it to GNU AWwk, you could use this rule: {print gensub(/(Auk)/, "GNU &", 1) } HOW TO USE REGULAR EXPRESSIONS IN AWK ‘This searches for the group of characters Awk and stores it in memory, represented by the special character &. Then it substitutes the string for GNU &, meaning GNU Awk. The 1 character at the end tells gensubi) to replace the first occurrence. 4 printf "Auk \nfuk is not Rukuard” \ | auk -e * £ print gensub(/(fuk)/, "GNU &"1) 2° NU fuk GNU fuk 15 not Aukuard There's a time and a place ‘Awkii a powerful tool, and regex are complex. You might think awvk is eo very powerful that it could easily replace grep and sed and tr and sort [2] and many more, and ina sense, you'd be right. However, awk is just one tool in a toolbox that’s overtiowing with great options. You have a choice about what you use and when you use I, 80 don't feel that you have to use one tool for every job great and emall With that said, awk really is a powerful too! with lots of great functions. The more you use it the better you get to know il. Remember is cepabillies, and fll Back on it occa sionally so can you get comfortable with it. Links [1] https ioponsource.com/articio/19/7!what-posberichard- staliman-explains [2] httpsvioponcource.com/article/19/10/get-sorted-sort, A PRACTICAL GUIDE TO LEARNING GNU AWK § CC BY:SA4.0 } OPENSOURCE.COM 17 Ne ee ite) ae dee ae 4 ways to control the flow of your awk script Leam to use switch statements and the break, continue, and next commands to control awk scripts. THERE ARE MANY tow tan ant cp including loops [1], switch statements and the break, con- the and ext rand ‘Sample data Create a sample data set called colours.txt and copy this content int it nave color anount apple reds banana _yellou & strawberry red 3 raspoerry red 99 grape purple 18 apple grean 8 plun purple 2 kivi broun 4 potato broun 9 pineapple yellou S SwItch statenents ‘The switch statement is a feature specific to GNU awk, so you can only use it with gawk. If your system or your target system doesn't have gawk, then you should not use a switch statement ‘The switch statement in gawk Is similar to the one in and many other languages. The syntax is: switch (expression) £ case VALUE: <0 soxething here> fel defaults 60 soxething here> ‘The expression part can be any awk expression that re- turns a numeric or string result. The VALUE part (after the word case) is a numeric or string constant or a regular expression. When a switch statement runs, the expression is evalu- ated, and the resull is matched against each case value. If there’s a match, then the code contained within a case defi- nition is executed, If there's no match in any case definition, then the default statement is executed. ‘The keyword break is al the end of the code in each case ‘definition to break the loop. Without break, awk would con- tinue to search for matching case values. Here's an example switeh statement: situs foun F 1¢ Use Of ‘sultch’ In GNU AM. moi printf "The Xs is classified as: suiteh (61) £ case “apple print "a fruit, pone" break case "grape": print "a fruit, berry) break case "raspberri print "a coxputer, pf break case "plu print "a fruit, drupe’ break 18 APRACTICAL GUIDE TO LEARNING GNU AWK ! CC BY-SA4.0 ? OPENSOURCE.COM 4 WAYS TO CONTROL THE FLOW OF YOUR AWK SCRIPT case "pineapple": print "a fruit, fused berries (syncarp) break case "potato's print "a vegetable, tuber’ break default: print "[unctassifed)" 1 ‘This script notably ignores the first line of the file, which in the case of the sample data is just a header. It does this by operating only on records with an index number greater than 1. On all other records, this script compares the con- tents of the first field ($1, as you know from previous arti- cles) to the value of each ease definition. If there's a match, the print function is used to print the botanical classification of the entry. If there are no matches, then the default in- stance prints "[unclassified}" The banana, grape, and kiwi are all botanically classified as a berty, so there are three case definitions associated with one print result Run the script on the colours.txt sample flle, and you should get this: The apple is classified as: a fruit, pone The banana is classified as: a fruit, berry ‘he strauberry 15 classified as: [unciassifed] The raspberry is classifed as: a coxputer, pi The grape 1s classified as: a fruit, berry The apple is classified as: a fruit, pone The plun Is classified as: a fruit, drupe The kiui is classified as: a fruit, berry The potato 1s classified as: a vegetable, tuber The pineapple is classified as: a fruit, fused berries (suncarp) Break ‘The break statement is mainly used for the early termination of a for, while, or do-while loop or a switch statement, na loop, break is often used where i's not possible to determine the number of iterations of the loop beforehand. Invoking break terminates the enclosing loop (which is relevant when there are nested loops or loops within loops). This example, stright out of the GNU awk manual (2, shows ‘@ method of finding the smallest dvisor. Read the additional comments fora clear understanding of how the code works: ipinfauk -F un = $1 4 Nake an infinite FOR for (divisor aivisorss) £ © If num is divisible by If (nun X divisor == 6) £ printf "Snallest divisor of ™d is s¢\n", or, then break run, divisor break 3 # If divisor has gotten too large, the nunber If (divisor * divisor > num) £ print °sd is prine\n", num break } Try running the seript to see its results: § echo 67 | ./divisor.auk 67 1s prine § echo 69 | ./divisor.auk Suallest divisor of 69 is 3 ‘As you can see, even though the script starts out with an ‘explicit finite loop with no end condition, the break function ensures that the script eventually terminates. Continue ‘The continue function is similar to break. It can be used in for, while, or do-while loop (I's not relevant to a switch statements, though). Invoking continue skips the rest of the enclosing loop and begins the next cycle. Here's another good example from the GNU awk manual to demonstrate a possible use of continue: s1/usr/o1 BEGIN { for (x 20: wee) £ if (== 5) : continue printe "sd" x . } . print me } This script analyzes the value of x before printing any- thing, If the value is exactly 5, then continue is invoked, causing the printf line to be skipped, but leaves the loop APRACTICAL GUIDE TO LEARNING GNU AWK ! CC BY.SA4.0 ! OPENSOURCE.COM 19 4 WAYS TO CONTROL THE FLOW OF YOUR AWK SCRIPT unbroken. Try the same code but with break instead to see the difference. Next ‘This statement is not related to loops like break and continue are. Instead, next applies to the main record processing cycle of awk: the functions you place between the BEGIN and END functions. The next statement causes awk to stop processing the current input record and to move to the next one. ‘As you know from the earlier articles in this series, awk reads records from its input stream and applies rules to them. The next staternent stops the execution of rules for the current record and moves to the next one. Here's an example of next being used to “hold” information ‘upon a specific condition: a /usr/bin/auk -F # Ignore the header NRm Lf next ¥ # IF Geld 2 (colour) is less than & # characters, then save it with its 4 Line nusber and skip it Tength(s2) <6 £ ‘skipINR] = $2 ext 3 # It's not te header and # the colour nave is > € characters, # 0 print tne tine f print. a 4 At th ond eno printf "\nskippeds\ for (nin skp) print ns “skiplo) shou uhat wos skipped ‘This sample uses next in the first rule to avoid the first line of the fle, which Is a header row. The second rule skips lines ‘when the color name is less than six characters long, but It also saves that line in an array called skip, using the line number as the key (also known as the index). ‘The third rule prints anything it sees, but It is not invoked it ‘either rule 1 or rule 2 causes it to be skipped. Finally, atthe end ofall the processing, the END rule prints the contents of the array. Run the sample script on the colours.txt file trom above (and previous articles) § -dnext.auk colours. txt banana yellow & grape purple 16 plun purple 2 Pineapple yellow 5 Skipped 2 apple red 4 45 strauberry red 2 6: apple green @ 8: kiwi droun 4 %: potato brown & Control freak In summary, switch, continue, next, and break are import- ant preemptive exceptions to awk rules that provide greater ‘control of your script. You don't have to use them directly; ‘often, you can gain the same logle through other means, but they're great convenience functions that make the cod €r's life a lot easier. The next article in this series covers the printf statement. Links [1] htips:/opensource.convarticle/18/11loops-awk [2] httas eww gnu.org/software/gawkimanuall 20 APRACTICAL GUIDE TO LEARNING GNU AWK ! CC BY-SA4.0 ? OPENSOURCE.COM Ree ee ARO Ae Advance your awk skills with two easy tutorials Go beyond one-line awk scripts with mail merge and word counting. AWK IS ONE Siscushsinne Ua ana Linux user's toolbox. Created In the 1970s by Alfred Aho, Peter Weinberger, and Bri an Kemighan (the A, W, and K of the tool's name), awk ‘was created for complex processing of text streams, It is @ companion tool to sed, the stream editor, which is de- signed for line-by-line processing of tex! files. Awk allows more complex structured programs and is a complete pro- ‘gramming language. ‘This article will explain how to use awk for more struc- tured and complex tasks, including a simple mail merge application. Aw program structure ‘An ausk seript is made up of functional blocks surrounded by {} (curly brackets). There are two special function blocks, BEGIN and END, that execute before processing the first line ofthe input stream and after the lastline is processed. In between, blocks have the format pattern { action statenents } Each block executes when the line in the input butter match- ‘es the pattem. If no pattem is included, the function block executes on every line of the input stream, Also, the following syntax can be used to define functions In awk that can be called from any block: function nana(paraneter list) { statenents } ‘This combination of pattern-matching blocks and functions allows the developer to structure awk programs for reuse and readabilly. How awk processes text streams ‘Awk reads text from its input fle or stream one line at a time and uses a field separator to parse it into a number of fields. In awk terminology, the current butfer is a record. There are ‘a number of special variables that affect how awk reads and processes a file: “FS (field separator): By default, this Is any whitespace (spaces or tabs) + RS (record separator): By default, a newline (\n) + NF (number of fields): When awk parses a line, this variable Is set to the number of flelds that have been parsed + $0; The current record + $1, $2, $3, etc.: The first, second, third, etc. fled from the ccurtent record «NR (number of records): The number of records that have been parsed so far by the awk script There are many other variables that affect awk’s behavior, but this is enough to start with, Awk one-liners. For a tool so powerful i's interesting that most of awk's us ‘age Is basic one-liners. Perhaps the most common awk pro- 4gram prints selected fields from an input ine from @ CSV tile, log tle, etc. For example, the following one-line prints a list of userames from /ete/passwd: auk F's" “{print $1 2° sete/passua ‘As mentioned above, $1 isthe frst leld in the current record ‘The -F option sets the FS variable to the character = ‘The field separator can also be set in a BEGIN function block: . ‘uk ‘BEGIN { F5=":" } {print $1 1" /eto/passud In the following example, every user whose shell is not sbin/ rnologin can be printed by preceding the biock with a pattem match: APRACTICAL GUIDE TO LEARNING GNU AWK ! CC BY.SA4.0 ! OPENSOURCE.COM 21 Ce Seen aL le ‘uk “BEGIN £ Jetc/passud “J 1 Afsbin\/nologin/ {print 41 3° Advanced awk: Mail merge Now that you have some of the basics, try delving deep- er into awk with @ more structured example: creating a mall merge. ‘Armall merge uses two files, one (called in this example email_template.txt) containing a template for an email you want to send: Frou: Progran connittee To: {frstnane} flastnane} Subject: Your presentation propesal Dear ffrstnane}, Thank you For your presentation proposal: {title} He are pleased to inform you that your proposal has been successful ! We UIT] contact you shortly with further information about ‘the event schedule. Thank you, The Progran Connittee ‘And the other is a CSV file (called proposals.csy) with the People you want to send the email to: firstnane, lastnane,enail, title Harry, Potter hpotterthoquarts.edu,"DeFeating your rnenesis in 3 easy steps” Jack, Reacher reachercovert.ail,"Hand-to-hand combat, or beginners” ickey,touse, mnouse@disney. cow,’ speaking with 2 squeaky voice” Santa, Claus, sel ausénarthpole.org, surviving public You want to read the CSV file, replace the relevant flelds in the first file (skipping the first line), then write the result to a file called acceptanceN.txt, incrementing N for each line ‘you parse. Write the awk program in a file called mall_merge.awk Statements are separated by ; in awk scripts. The first task Is to set the field separator variable and a couple of other variables the script needs. You also need to read and dis- card the frst ine in the CSV, or a file will be created starting ‘with Dear firstname. To do this, use the special function getline and reset the record counter to 0 after reading il BEGIN Fs tenplate="enail_tenpleate. txt" utput="acceptance’s get! ines Nees y ‘The main function is very straightforward: for each line pro- ‘cessed, a Variable is set for the various fields—firstname, lastname, email, and title. The template fle is read line by line, and the function sub is used to substitute any occur- rence of the special character sequences with the value of the relevant variable. Then the line, with any substitutions made, is output to the output fie Since you are dealing with the template file and a dif- ferent output fle for each line, you need to clean up and close the fle handles for these files before processing the next record. t 4 Read relevant felds fron input fle frstnane=t1; Tastnane=t2 4 Set output slenane outle=(output NR ".txt")s 4 Read a line fron template, replace specia) 4 falds, and print result to output fle untle ( (gett ine In < tenplate) > @ ) { suo(/Efrstnane}/, frstnae, In): sub(/flastnane}/, lastnaxe, In): sup(/fenat1}/,enal 1,19); sub(/{title}/,title, In); print(in) > outsle: } Close template and output fle in advance of next record close(outfle): close(template); 1 You're done! Run the script on the command line with: fuk -F mail_aerge.auk proposals.csv or 22 APRACTICAL GUIDE TO LEARNING GNU AWK ! CC BY-SA4.0 OPENSOURCE.COM Ree ee ARO Ae auk -F mail_nerge.auk < proposals.csv and you will find text fles generated in the current directory. Advanced awk: Word frequency count One of the most powertl features in awk isthe associative array. In most programming languages, array entries are typ- ically indexes by a number, but in aw, arrays are referenced by akey sting. You could store an enty from the fle propos- als. trom the previous section. For exemple, in a single associative array, ike this: proposer{firstnane’ proposer{"Tastnan proposert” proposer "title" j= 1°42, mai)" 835, ‘This makes tex! processing very easy. Asimple program that Uses this concept is the idea of @ word frequency counter. You can parse alle, break out words (ignoring punctuation) in each line, increment the counter for each word in the line, then output the top 20 words that occur inthe text. Firs, ina file caled wordcountawk, set the field separator to € regular expression that includes whitespace and punctuation: BEGIN { # ignore 1 or nore consecutive occurrences of the characters # In the character group below peotaery" vt]! "Next, the main loop funetion will iterate over each field, ignor- ing any empty fields (which happens if there is punctuation at the end of a line), and increment the word count for the ‘words in the line. { for (= 1 1c NF 164) if (8 £ uordstsi e+; 3 3 1 Finally, aller the text is processed, use the END function to print the contents of the array, then use awk's capability of piping output into a shell command to do a numerical sort ‘and print the 20 most frequently occurring words: eno f sort_head = "sort -42 -ne | head -n 28": APRACTICAL GUIDE TO LEARNING GNU AWK ! CC BYSA4.0 For (uord in uords) printe "3s\ted\n", word, uords[uord) | sort_head: z close (sort head); } Running this script on an earlier draft of this article produced this output: [near yedhep-49-32.bos.redhat .com]# auk -F uordeaunt. auk < ak article. txt the 79 auk 41 2 8 and of 2 in 27 to. iss line 23, for 23 will fle 2 ve 6 We 15 with 12 hich 12 bye this output 11 Function "1 What's next? Ityou want to leam more about awk programming, I strongly recommend the book Sed and awk (1] by Dale Dougherty and Arnold Robbins. (One of the keys to progressing in awk programming is ‘mastering “extended regular expressions.” AWk offers sever- al powertul additions to the sed regular expression (2] syntax you may already be familiar with, Another great resource for learning awk is the GNU awk user guide [3]. Ithas a full reterence for awk’s built-in func- tion library, as well as lots of examples of simple and com- plex awk scripts Links [1] bttps:sAnww amazon com/sed-awk-Dale-Dougherty! . \6p11505922255/b00k [2] httpsion.wikibooks orgiwikiRegular_Exprossions/POSIX. Extencied_Regulet_Expressions []_https:nww.gnu.orgisoftwarelgawkimanualigawk him! OPENSOURCE.COM 23 HOW TO REMOVE DUPLICATE LINES FROM FILES WITH AWK How to remove duplicate lines from files with awk Learn how to use awk ‘visited[$0)++' without sorting or changing their order. SUPPOSE more ates ciple TLDR ‘To remove the duplicate lines while preserving their order in the file, use: uk ‘Ivisited[s0]>+" your fle > dedupl icated fle How it works ‘The script keeps an associative array with indices equal to the Unique lines of the file an valves equal to their occurrences. For each line ofthe fl, If the ine occurrences are zero, then it increases them by one and prints the fine, otherwise, it ust Increases the occurrences without printing the line. Twas not familiar with awk, and | wanted to understand how this can be accomplished with such a short script (awk- ward). | did my research, and here is what is going on: +The awk ‘script’ Wvisited[S0]+ is executed for each fine of the input fle. + visited] is a variable of type associative array [1] (aka. Map (2}). We don’t have to initialize it because awk will do itthe frst time we access it +The $0 variable holds the contents of the line currently be- ing processed. + visited{S0] accesses the value stored in the map with a key equal to $0 (the line being processed), a.k.a. the occur- rences (which we set below). +The ! negates the occurrences’ value: In awk, any nonzero numeric value or any nonempty string value is true [3]. + By default, variables are inilialized to the empty string (4), which is zero i converted to a number. + Thal being said: + If visited[$0] returns a number greater than zero, this negation is resolved to false. + Itvisited[$0] returns a number equal to zero or an emp- ly string, this negation Is resolved to true. “The ++ operation increases the variable's value (visit- ed{$0}) by one. + It the value fs empty, awk converts it to 0 (number) auto- matically and then it gets increased + Note: The operation is executed after we access the var able’s value. ‘Summing up, the whole expression evaluates to: + true ifthe occurrences are zero/emply string + false ifthe occurrences are greater than zero awk statements consist of a pattern-expression and an as- sociated action [5] cpatterniexpressiom { 3 It the pattem succeeds, then the associated action is exe- ‘cuted. If we don’t provide an action, awk, by default, prints the input An omitted action is equivalent to { print $0 } ur script consists of one awk statement with an expression, ‘omitting the action. So this: ‘uk “‘Ivisited[te]++" your_fle > deauplicated fle is equivalent to this: ‘uk “Ivisited[te]e+ { print $8 }° your_ste > deduplicated fle For every line of the fll, if the expression succeeds, the line is printed to the output. Otherwise, the action is not execut- ‘ed, and nothing Is printed, Why not use the uniq command? The unlq command removes only the adjacent duplicate fines. Here's a demonstration’ 24 APRACTICAL GUIDE TO LEARNING GNU AWK ! CC BY-SA4.0 OPENSOURCE.COM HOW TO REMOVE DUPLICATE LINES FROM FILES WITH AWK cat test.tet ‘ A A A 8 B 8 A A c c c e 8 A $ unig ¢ test.tet A 8 A c 8 A Other approaches Using the sort command ‘We can also use the following sort [6] command to remove the duplicate lines, but the line order is not preserved. sort -u your_fle > sorted deduplicated fle Using cat, sort, and cut ‘The previous approach would produce a de-duplicated fle ‘whose lines would be sorted based on the contents. Piping @ bunch of commands [7] can overcome this issue: cat -n your_fle | sort -uk2 | sort -nkt | cut -F2- How it works ‘Suppose we have the following fle abe on abe ef a af oni aa ‘cat -n test.txt prepends the order number in each line. abe hi abe oer aur 6 ter 7 ghi 8 kn ‘sort -uk2 sorts the lines based on the second column (k2 ‘option) and keeps only the first occurrence of the lines with the same second column value (u option). abe aer gh kn xyz ‘sort -nk1 sorts the lines based on their first column (kt op- tion) treating the column as a number (-n option). abe gh aer ww kin Finally, eut -12- prints each line starting trom the second ‘column until its end (-f2- option: Note the - suffix, wich in- ‘structs Ito include the rest of the line) abe ont eer wz kin References +The GNU awk user's guide “Arrays in awk + Awk—Truth values + Awk expressions * How can I delete duplicate lines in a file in Unix? + Remove duplicate lines without sorting (duplicate) + How does awk ‘la[S0]++' work? Links [1] http:nrste userpage fu-berlin defchermnetiuse/info/gawk! ‘gawk_12.himi : [2] https:ion wikipedia orgiwiki/Associative_array [B]_https:vnw.gnu.org/sottware/gawkimanualmhtm|_noder ‘Truth-Values html . [4] https:ttp gnu.orgiokd-gnu/Manualsigawh-3.0.9"htm_ ‘chapterigawk_8.html [6] http:ikrste-userpage.fu-berlin.de/chemnetvusefinto/gawk! ‘gank_9.himl [6] httpziman7.orgfinuximan-pagesimant/sort.1 ml 7] https:istackovertiow.com/a/20639730/2282448, A PRACTICAL GUIDE TO LEARNING GNU AWK § CC BY:SA4.0 } OPENSOURCE.COM 25 INERS AND SCRIPTS TO HELP YOU SORT TEXT FILES Awk one-liners and scripts to help you sort text files Awk is a powerful tool for doing tasks that might otherwise be left to other common utilities, including sort. AWK IS THE waza. oe fr searing and processing text containing pre- dictable pattems. However, because it features functions, its also justitiably called a programming language. Confusingly, there is more than one awk. (Or, if you be- lieve there can be only one, then there are several clones.) There's awk, the original program written by Aho, Weinberg er, and Kernighan, and then there's nawk, mawk, and the GNU version, gawk. The GNU version of awk is a highly por- table, free software version of the utilty with several unique features, so this article Is about GNU awk. While Its official name is gawk, on GNUsLinux systems I's allased to awk and serves as the default version of that command. On other systems that don't ship with GNU awk, you must install and refer to it as gawk, rather than awk. This arlcle uses the terms awk and gawk interchangeably. Being both a command and a programming language makes awk @ powerful tool for tasks that might otherwise be left to sort, cut, uniq, and other common utilities. Luck- lly, there's lots of room in open source for redundancy, so if you're faced with the question of whether or not to use awk, the answer Is probably a solid “maybe.” ‘The beauty of awr’s flexibility is that if you've already com- mitted to using awk for a task, then you can probably stay in ‘wk no matter what comes up along the way. This includes the eternal need to sort data in a way other than the order It ‘was delivered to you. Sample set Belore exploring awk’ sorting methods, generate a sample dataset to use. Keep it simple so that you don't get distracted by edge cases and unintended complexily. This is the sam ple set ths article uses: Aptenodytes; forsteri sti er, JF;1778;Enperor Pygoscel is;papua; Nagler; 1832; Gentoo Eudyptula;minor; Bonaparte; 1867;Little Blue Spheniscus; denersus;Brisson;1760; African Hegadyptes: anti podes: i Ine-Eduards; 1880; Vel lou-eved Eudyptes;chrysocone;Viellot;181¢;Sethern Rockhopper Torvaldis; inuxsEwing,L;1996s Tux Its a small dataset, butit offers a good variety of data types: *Agenus and species name, which are associated with one another but considered separate *Asurname, sometimes with first initials after a comma + An integer representing a dale An arbitrary term + Allfields separated by semi-colons Depending on your educational background, you may con- sider this a 2D array or a table or just a ine-delimited collec tion of data. How you think of It is up to you, because awk doesn't expect anything more than text. I's up to you to tell ‘awk how you wan! fo parse It The sort cheat I you just want to sorta text dataset by a specific, definable field (think of a “cell” in a spreadsheet), then you can use the sort command [1]. Fields and records Regardless ofthe format of your input, you must find patterns in It's0 that you can focus on the parts ofthe data that are import- ant to you. In this example, the data Is delimited by two factors: lines and fields. Each new line represents a new record, as you ‘would lkely see in a spreadsheet or database dump. Within ‘each line, there are distinct fields (think of them as cells in a spreadsheet) that are separated by semicolons (). ‘Awk processes one record at a time, so while you're structut- Ing the instructions you will give to awk, you can focus on just 26 APRACTICAL GUIDE TO LEARNING GNU AWK ! CC BY-SA4.0 OPENSOURCE.COM SD Se eS eS aA alae OS LS) ‘one line. Establish what you want to do with one line, then test It (elther mentally or with aw) on the next ine and a few more. Youll end up with a good hypothesis on what your awk script ‘must doin order to provide you with the data structure you want. In this case, i's easy to see that each field is separated by @ semicolon. For simplicity’s sake, assume you want to sort the list by the very first field of each line. Before you can sort, you must be able to focus awk on just the first field of each line, so that’s the first step. The syntax ‘of an awk command in a terminal is awk, followed by rele- vant options, followed by your awk command, and ending withthe file of data you want to process. $ auk —-Feld-separator= Aptenodytes Pygoseelis Eudyotula Spheni scus hegadyptes Eudyptes Torvaldis {print $15)" penguins. List Because the field separator is a character that has special meaning to the Bash shel, you must enclose the semicolon in quotes or precede it with a backslash. This command is Useful only to prove that you can focus on a specific fleld. You can try the samme command using the number of another field to view the contents of another “column” of your data: auk --feld-separato miner JF Hagler Bonaparte Brisson i Tne-Eduaras Viettot EuingsL {print $3:)" penguins. List Nothing has been sorted yet, but this is good groundwork. Scripting ‘Awik is more than just a command; ls a programming lan- ‘guage with indices and arrays and functions. Thats sign ‘cant because it means you can grab lst of elds you want to sort by, store the list in memory, process it, and then print the resuling data. For a complex series of actions such as this, its easier to work intext le, so create anew fle called sorter awk and enter this text: #IJuse/bin/auk ~# BEGIN { 1 This establishes the fle as an awk script thal executes the lines contained in the flle. The BEGIN statement is a special setup function provid- ‘ed by awk for tasks that need fo occur only once, Defining the builtin variable FS, which stands for field separator and is the same value you set in your awk command with ~field-separator, oniy needs to happen once, so it’s includ- ed in the BEGIN statement, Arrays in awk You already know how to gather the values of a specific field by using the $ notation along with the field number, but in this case, you need to store itn an array rather than print it tothe terminal, Ths Is done with an awk array. The im- portent thing about an awk array is that it contains keys And velues. Imagine an array about this arte; it would look something lke this: author:"seth,title:"How to sort with awk? Jength:1200, Elements tke author and ttle and length are keys, with the following contents being values. The advantage to this inthe context of sorting is that you can assign any field as the key and any record as the value, and then use the builtin awk function asorti( (sort by index) to sort by the key. For now, assume arbitrarily that you only Want to sor by the second fel Awk statements not preceded by the special keywords BEGIN or END are loops that happen at each record. This isthe par of the script that scans the data for patterns and processes it accordingly. Each time awk tums is attention to a record, statements in {} (unless preceded by BEGIN or END) are executed ‘To add a key and value to an array, create a variable (in this example serip, | cll it ARRAY, which isn terbly orig nai, but very clear) containing an array, and then assign ta key in brackets and a value with an equals sign (=). { # dusp each Fold into on arcay fmrav(s2] = $83 } In this statement, the contents of the sacond field ($2) are used as the key term, and the current record (SR) is used asthe value. The asorti() function In addition to arrays, awk has several basic functions that you can use as quiek and easy solutions for common tasks. ‘One of the functions introduced in GNU awk, asortiO, pro- Vides the abilty to sort an array by key (or index) or value. ‘You can only sort the array once it has been populated, ‘meaning that this action must not occur with every new record but only the nal stage of your script. For this pur- pose, awk provides the special END keyword. The inverse of BEGIN, an END statement happens only once and only alter all records have been scanned. APRACTICAL GUIDE TO LEARNING GNU AWK ! CC BY.SA4.0 ! OPENSOURCE.COM 27

You might also like