- Home
- >
- Software Development
- >
- Data Processing at the Edge with Linux awk – InApps Technology 2022
Data Processing at the Edge with Linux awk – InApps Technology is an article under the topic Software Development Many of you are most interested in today !! Today, let’s InApps.net learn Data Processing at the Edge with Linux awk – InApps Technology in today’s post !
Read more about Data Processing at the Edge with Linux awk – InApps Technology at Wikipedia
You can find content about Data Processing at the Edge with Linux awk – InApps Technology from the Wikipedia website
Last June, data scientist and visualization expert Nick Strayer learned a valuable lesson in large scale data processing: Sometimes even the latest “Big Data”-oriented software doesn’t as well as what we already have in the Unix toolbox. Looking to parse 25TB of genetic data, he tried using tools such as Parquet and Spark, but in the end, he found the best solution was a combination of the R statistical programming language and the humble awk.
Sometimes we need to take large data sets and hack them into something easily analyzed by a human. Other times, that same data might need to be converted from one format to another, such as when you move from using one application to a new one.
awk is an awesome Linux command-line program for performing those types of tasks. It typically takes plain text as input and produces specifically formatted output. Of course, you could do that with common programming languages like Python or C. If you like to develop lots of custom code, that’s one way to do it. Linux has awk, a built-in utility that’s programmable, so why not use it?
Today we’ll begin exploring awk with a few basic examples. Future articles will cover more advanced topics.
Start Simple
Say we want to print a big listing of files on our system and just show the file names. As always, Linux offers multiple solutions for this task. Use the standard “ls” command with the “-1” option (that’s the number one).
rob% ls -1
Simple, right?
The same could be done with awk, although we’ll need to do a little extra work. Again, get a list of the files, this time using the “ls” command with the “-l” option (the letter l) and redirect the output to a file.
rob% ls -l > rob2.txt
I used the “head” command with the “-n” option for this screenshot, to display the first dozen lines of the rob2.txt file. The first line of the listing shows the number of 1k blocks used by the files in the directory. For our purposes, we can just delete it to clean up the file. Keep in mind that most real-world data conversions or transformations usually need a small bit of manual intervention, to get everything automated. It’s just the nature of slicing and dicing data. I removed the line using the vi editor and re-saved the file for later use.
Notice the printout is much more complicated.
Start by running the file through awk, using its standard print syntax. This outputs each line in the rob2.txt file, much like the normal Linux “cat” command.
rob% awk '{print}' rob2.txt
We only want the file names, so should use a field with the print statement. The file names are in field nine and the field separating character is a space, which is the default. Fields are indicated with the “$” sign.
rob% awk '{print $9}' rob2.txt
That’s better. It looks just like the “ls -1” command output.
A little more complex example is to print the file names, followed by their creation dates.
rob% awk '{print $9,$6,$7,$8}' rob2.txt
Notice that we used other fields and they can be placed anywhere you like. We could easily have put the date in front of the file name, if we wanted. Field six is the month. Field seven is the day. And, field eight is the year.
Sometimes files use commas or other characters for their field separator. Spreadsheets generate a comma separator when you export a .csv file (Comma Separated Values) from MS Excel or LibreOffice Calc. Use the “-F” option to specify the desired field separator in awk. Here’s an example of the command line you’d use for a comma.
rob% awk -F',' '{print $9,$6,$7,$8}' filename
You can also insert text into the printout. Adding a “Date =” label might be useful.
rob% awk '{print $9,"Date =",$6,$7,$8}' rob2.txt
You Can Search, Too
awk has built-in search capabilities. Suppose we want to print out only the lines that contain “2015”. We could use the following.
rob% awk '/2015/ {print $9,"Date =",$6,$7,$8}' rob2.txt
I verified the output with a quick grep for “2015” in the file.
rob% grep 2015 rob2.txt
Another way to search is by comparing a field to a value. We can compare field eight (the year) to “2015”.
rob% awk '{if ($8==2015) print $9,"Date =",$6,$7,$8}' rob2.txt
Maybe you’d want to search for years greater than “2015.” Use a comparison there too.
rob% awk '{if ($8>2015) print $9,"Date =",$6,$7,$8}' rob2.txt
One More Thing
I mentioned at the beginning of the article, that awk was great for data conversion or translation.
Suppose we want to change the year from 2015 to 2016, when it occurs in field 8 (the year). It is as easy as replacing the “$8” field, in the print part, with “2016”.
rob% awk '{if ($8==2015) print $9,"Date =",$6,$7,"2016"}' rob2.txt
Although this is a trivial example, in principle it could be used in quite a few practical situations.
Going Further
awk has a lot of options and it can handle seriously large files. I usually use quick one-liners and output results, on-the-fly, to my terminal or save it to a new file using a standard Linux redirection (the > character). awk has scripting capabilities and that can get quite complex. We can investigate those details in a future story.
Data conversions and translations can be tedious. awk, while practically magical does have a learning curve. With a little bit of practice, awk is certainly better than going through a data file manually.
Don’t forget that awk is available everywhere. You will find it on Linux servers, desktops, notebooks, the Raspberry Pi boards and a variety of nano-Linux machines. Maybe use awk for standalone high-powered data processing at the edge.
TNS Managing Editor Joab Jackson contributed to this post.
Contact Rob “drtorq” Reilly for consultation, speaking engagements and commissioned projects at [email protected] or 407-718-3274.
List of Keywords users find our article on Google:
awk print |
hire awk developers |
awk 2022 |
linux at |
awk replace |
libreoffice label template |
libreoffice calc text to number |
best linux printer 2020 |
simple human liners |
awk group |
libreoffice export csv |
institutional can liners |
awk wikipedia |
linux edge |
ls stories |
libre calc if statement |
libreoffice calc wikipedia |
libreoffice calc range |
at linux |
the edge by common |
libreoffice calc date format |
awk programming language |
shopify app cli |
simple human r liners |
“redirection consultant” |
office 2016 txt |
qc-calc |
best elearning software 2015 |
awk trim whitespace |
Source: InApps.net
Let’s create the next big thing together!
Coming together is a beginning. Keeping together is progress. Working together is success.