1 Question: Reliably checking wget log for errors using grep or ack

question created at Sat, Jun 1, 2019 12:00 AM

In a bash file, I have logfileA.txt that contains output from wget that I'd like to run grep on to check for any instances of the words "error" or "fail", etc, as so:

grep -ni --color=never -e "error" -e "fail" logfileA.txt | awk -F: '{print "Line "$1": "$2}'
# grep -n line number, -i ignore case; awk to add better format to the line numbers (https://stackoverflow.com/questions/3968103)

Trouble is though, I think the wget output in logfileA.txt is full of characters that may be messing up the input for grep, as I'm not getting reliable matches.

Troubleshooting this, I cannot even cat the contents of the log file reliably. For instance, with cat logfileA.txt, all I get is the last line which is garbled:

FINISHED --2019-05-29 17:08:52--me@here:/home/n$ 71913592/3871913592]atmed out). Retrying.

The contents of logfileA.txt is:

--2019-05-29 15:26:50--  http://somesite.com/somepath/a0_FooBar/BarFile.dat
Reusing existing connection to somesite.com:80.
HTTP request sent, awaiting response... 302 Found
Location: http://cdn.somesite.com/storage/a0_FooBar/BarFile.dat [following]
--2019-05-29 15:26:50--  http://cdn.somesite.com/storage/a0_FooBar/BarFile.dat
Resolving cdn.somesite.com (cdn.somesite.com)... xxx.xxx.xx.xx
Connecting to cdn.somesite.com (cdn.somesite.com)|xxx.xxx.xx.xx|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3871913592 (3.6G) [application/octet-stream]
Saving to: 'a0_FooBar/BarFile.dat’

a0_FooBar/BarFile.dat   0%[                    ]       0  --.-KB/s               
a0_FooBar/BarFile.dat   0%[                    ]  15.47K  70.5KB/s               
...
a0_FooBar/BarFile.dat  49%[========>           ]   1.80G  --.-KB/s    in 50m 32s 

2019-05-29 16:17:23 (622 KB/s) - Read error at byte 1931163840/3871913592 (Connection timed out). Retrying.

--2019-05-29 16:17:24--  (try: 2)  http://cdn.somesite.com/storage/a0_FooBar/BarFile.dat
Connecting to cdn.somesite.com (cdn.somesite.com)|xxx.xxx.xx.xx|:80... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 3871913592 (3.6G), 1940749752 (1.8G) remaining [application/octet-stream]
Saving to: 'a0_FooBar/BarFile.dat’

a0_FooBar/BarFile.dat  49%[+++++++++           ]   1.80G  --.-KB/s               
...
a0_FooBar/BarFile.dat 100%[+++++++++==========>]   3.61G  1.09MB/s    in 34m 44s 

2019-05-29 16:52:09 (909 KB/s) - 'a0_FooBar/BarFile.dat’ saved [3871913592/3871913592]

FINISHED --2019-05-29 17:08:52--

I assume the problem could be the /s or ---s or >s or ==>s or |s?

But since the output from wget could vary, how do I anticipate and escape anything problematical for grep?

Command:

grep -ni --color=never -e "error" -e "fail" logfileA.txt | awk -F: '{print "Line "$1": "$2}'

Expected output:

Line 17: 2019-05-29 16:17:23 (622 KB/s) - Read error at byte 1931163840/3871913592 (Connection timed out). Retrying.

Also, would an ack line be better at this job? And if so, what/how?

2
1 Answers 1

Wrt I assume the problem could be the /s or ---s or >s or ==>s or |s? - no, there's nothing special about any of those characters/strings. It sounds like you might have DOS line endings (\r\n), see Why does my tool output overwrite itself and how do I fix it?. Since you said with cat logfileA.txt, all I get is the last line which is garbled I wonder if you ONLY have \rs and no \ns as line endings. If you do then tr '\r' '\n' < logfileA.txt > tmp && mv tmp logfileA.txt would fix that. If that IS the issue then going forward you can use awk -v RS='\r' 'script' to change the record separator from it's default \n to \r and then you won't need to do that tr step.

You don't need grep when you're using awk though. This:

grep -ni --color=never -e "error" -e "fail" logfileA.txt |
    awk -F: '{print "Line "$1": "$2}'

can be written as just:

awk 'tolower($0) ~ /error|fail/{print "Line "NR":"$0}' logfileA.txt

but the awk-only version is more robust as it'll correctly display full lines that contain :s where the grep+awk version will truncate them to the first :.

You can handle the DOS line endings, if any, by tweaking the script to:

awk 'tolower($0) ~ /error|fail/{sub(/\r$/,""); print "Line "NR":"$0}' logfileA.txt

and you can make it look for error or fail as standalone words (as opposed to part of other strings like terror or failles) by doing this with GNU awk:

awk -v IGNORECASE=1 -v RS='\r?\n' '/\<(error|fail)\>/{print "Line "NR":"$0}' logfileA.txt

or this with any awk:

awk 'tolower($0) ~ /(^|[^[:alnum:]_])(error|fail)([^[:alnum:]_]|$)/{sub(/\r$/,""); print "Line "NR":"$0}' logfileA.txt
2
2019-06-01 07:12:29Z
  1. How can awk -v IGNORECASE=1 -v RS='\r?\n' '/\<(error|fail)\>/{print "Line "NR":"$0}' logfileA.txt be modified to catch "errors" and "failed" but not "terror" and "failles"?
    2019-06-01 07:11:36Z
  2. Just change the regexp to /\<(errors?|fail(ed)?)\>/
    2019-06-01 07:13:20Z
source placed here