­
Networking

Unix and Linux network configuration. Multiple network interfaces. Bridged NICs. High-availability network configurations.

Applications

Reviews of latest Unix and Linux software. Helpful tips for application support admins. Automating application support.

Data

Disk partitioning, filesystems, directories, and files. Volume management, logical volumes, HA filesystems. Backups and disaster recovery.

Monitoring

Distributed server monitoring. Server performance and capacity planning. Monitoring applications, network status and user activity.

Commands & Shells

Cool Unix shell commands and options. Command-line tools and application. Things every Unix sysadmin needs to know.

Home » Commands & Shells, Data, Featured

Scraping a Web Page in Bash

Submitted by on September 23, 2021 – 11:33 am

Just a scripting exercise because I need to do something important, but I am procrastinating. The idea is simple: grab some URL with text containing somewhat structured data and convert it into a spreadsheet. I know, exciting…

I will use “The World’s Most Nutritious Foods” article from Pocket and try to make a spreadsheet out of it. The first step part below does these few things:

  1. Downloads the URL and dumps it to text format
  2. Removes leading spaces
  3. Grabs the title line for each food item and the seven subsequent lines
  4. Removes the number list prefix in the title line
  5. Removes any lower-case text in parentheses, i.e. “(lower-case text)”

tmpfile="$(mktemp)"
tmpdir="$(mktemp -d)"
url="https://getpocket.com/explore/item/the-world-s-most-nutritious-foods"
lynx --dump "${url}" | \
sed -r 's/^\s+//g' | \
grep -P '^[0-9]{1,3}\.\s' -A7 --group-separator=$'ooo' | \
sed -r 's/^[0-9]{1,3}\. //g' | \
sed -r 's/ \([a-z]\)//g' > "${tmpfile}"
tmpfile="$(mktemp)" tmpdir="$(mktemp -d)" url="https://getpocket.com/explore/item/the-world-s-most-nutritious-foods" lynx --dump "${url}" | \ sed -r 's/^\s+//g' | \ grep -P '^[0-9]{1,3}\.\s' -A7 --group-separator=$'ooo' | \ sed -r 's/^[0-9]{1,3}\. //g' | \ sed -r 's/ \([a-z]\)//g' > "${tmpfile}"
tmpfile="$(mktemp)"
tmpdir="$(mktemp -d)"
url="https://getpocket.com/explore/item/the-world-s-most-nutritious-foods"
lynx --dump "${url}" | \
sed -r 's/^\s+//g' | \
grep -P '^[0-9]{1,3}\.\s' -A7 --group-separator=$'ooo' | \
sed -r 's/^[0-9]{1,3}\. //g' | \
sed -r 's/ \([a-z]\)//g' > "${tmpfile}"

Sample output
[root@ncc1711:/tmp/tmp.2oktVowKGn] # cat "${tmpfile}"
SWEET POTATO
86kcal, $0.21, per 100g
A bright orange tuber, sweet potatoes are only distantly related to
potatoes. They are rich in beta-carotene.
NUTRITIONAL SCORE: 49
ooo
FIGS
249kcal, $0.81, per 100g
Figs have been cultivated since ancient times. Eaten fresh or dried,
they are rich in the mineral manganese.
NUTRITIONAL SCORE: 49
ooo
GINGER
80kcal, $0.85, per 100g
Ginger contains high levels of antioxidants. In medicine, it is used as
a digestive stimulant and to treat colds.
NUTRITIONAL SCORE: 49
...
[root@ncc1711:/tmp/tmp.2oktVowKGn] # cat "${tmpfile}" SWEET POTATO 86kcal, $0.21, per 100g A bright orange tuber, sweet potatoes are only distantly related to potatoes. They are rich in beta-carotene. NUTRITIONAL SCORE: 49 ooo FIGS 249kcal, $0.81, per 100g Figs have been cultivated since ancient times. Eaten fresh or dried, they are rich in the mineral manganese. NUTRITIONAL SCORE: 49 ooo GINGER 80kcal, $0.85, per 100g Ginger contains high levels of antioxidants. In medicine, it is used as a digestive stimulant and to treat colds. NUTRITIONAL SCORE: 49 ...
[root@ncc1711:/tmp/tmp.2oktVowKGn] # cat "${tmpfile}"
SWEET POTATO

86kcal, $0.21, per 100g

A bright orange tuber, sweet potatoes are only distantly related to
potatoes. They are rich in beta-carotene.

NUTRITIONAL SCORE: 49
ooo
FIGS

249kcal, $0.81, per 100g

Figs have been cultivated since ancient times. Eaten fresh or dried,
they are rich in the mineral manganese.

NUTRITIONAL SCORE: 49
ooo
GINGER

80kcal, $0.85, per 100g

Ginger contains high levels of antioxidants. In medicine, it is used as
a digestive stimulant and to treat colds.

NUTRITIONAL SCORE: 49
...

All these steps, of course, are specific to the text you’re parsing. There is no universal approach to a task such as this.

The second step splits the file on the “ooo” separator and removes the trailing split file (that contains nothing of interest).

cd "${tmpdir}" && csplit -k "${tmpfile}" '/^ooo/' "{$(grep -c 'ooo' "${tmpfile}")}" 2>/dev/null
/bin/rm -f "$(ls | sort -V | tail -1)"
cd "${tmpdir}" && csplit -k "${tmpfile}" '/^ooo/' "{$(grep -c 'ooo' "${tmpfile}")}" 2>/dev/null /bin/rm -f "$(ls | sort -V | tail -1)"
cd "${tmpdir}" && csplit -k "${tmpfile}" '/^ooo/' "{$(grep -c 'ooo' "${tmpfile}")}" 2>/dev/null
/bin/rm -f "$(ls | sort -V | tail -1)"

Sample output
[root@ncc1711:/tmp/tmp.2oktVowKGn] # ls
xx00 xx04 xx08 xx12 xx16 xx20 xx24 xx28 xx32 xx36 xx40 xx44 xx48 xx52 xx56 xx60 xx64 xx68 xx72 xx76 xx80 xx84 xx88 xx92 xx96
xx01 xx05 xx09 xx13 xx17 xx21 xx25 xx29 xx33 xx37 xx41 xx45 xx49 xx53 xx57 xx61 xx65 xx69 xx73 xx77 xx81 xx85 xx89 xx93 xx97
xx02 xx06 xx10 xx14 xx18 xx22 xx26 xx30 xx34 xx38 xx42 xx46 xx50 xx54 xx58 xx62 xx66 xx70 xx74 xx78 xx82 xx86 xx90 xx94 xx98
xx03 xx07 xx11 xx15 xx19 xx23 xx27 xx31 xx35 xx39 xx43 xx47 xx51 xx55 xx59 xx63 xx67 xx71 xx75 xx79 xx83 xx87 xx91 xx95 xx99
[root@ncc1711:/tmp/tmp.2oktVowKGn] # cat xx04
ooo
BURDOCK ROOT
72kcal, $1.98, per 100g
Used in folk medicine and as a vegetable, studies suggest burdock can
aid fat loss and limit inflammation.
NUTRITIONAL SCORE: 50
[root@ncc1711:/tmp/tmp.2oktVowKGn] # ls xx00 xx04 xx08 xx12 xx16 xx20 xx24 xx28 xx32 xx36 xx40 xx44 xx48 xx52 xx56 xx60 xx64 xx68 xx72 xx76 xx80 xx84 xx88 xx92 xx96 xx01 xx05 xx09 xx13 xx17 xx21 xx25 xx29 xx33 xx37 xx41 xx45 xx49 xx53 xx57 xx61 xx65 xx69 xx73 xx77 xx81 xx85 xx89 xx93 xx97 xx02 xx06 xx10 xx14 xx18 xx22 xx26 xx30 xx34 xx38 xx42 xx46 xx50 xx54 xx58 xx62 xx66 xx70 xx74 xx78 xx82 xx86 xx90 xx94 xx98 xx03 xx07 xx11 xx15 xx19 xx23 xx27 xx31 xx35 xx39 xx43 xx47 xx51 xx55 xx59 xx63 xx67 xx71 xx75 xx79 xx83 xx87 xx91 xx95 xx99 [root@ncc1711:/tmp/tmp.2oktVowKGn] # cat xx04 ooo BURDOCK ROOT 72kcal, $1.98, per 100g Used in folk medicine and as a vegetable, studies suggest burdock can aid fat loss and limit inflammation. NUTRITIONAL SCORE: 50
[root@ncc1711:/tmp/tmp.2oktVowKGn] # ls
xx00  xx04  xx08  xx12  xx16  xx20  xx24  xx28  xx32  xx36  xx40  xx44  xx48  xx52  xx56  xx60  xx64  xx68  xx72  xx76  xx80  xx84  xx88  xx92  xx96
xx01  xx05  xx09  xx13  xx17  xx21  xx25  xx29  xx33  xx37  xx41  xx45  xx49  xx53  xx57  xx61  xx65  xx69  xx73  xx77  xx81  xx85  xx89  xx93  xx97
xx02  xx06  xx10  xx14  xx18  xx22  xx26  xx30  xx34  xx38  xx42  xx46  xx50  xx54  xx58  xx62  xx66  xx70  xx74  xx78  xx82  xx86  xx90  xx94  xx98
xx03  xx07  xx11  xx15  xx19  xx23  xx27  xx31  xx35  xx39  xx43  xx47  xx51  xx55  xx59  xx63  xx67  xx71  xx75  xx79  xx83  xx87  xx91  xx95  xx99

[root@ncc1711:/tmp/tmp.2oktVowKGn] # cat xx04
ooo
BURDOCK ROOT

72kcal, $1.98, per 100g

Used in folk medicine and as a vegetable, studies suggest burdock can
aid fat loss and limit inflammation.

NUTRITIONAL SCORE: 50

Finally, we read each xx* file and extract the five elements: item name, nutrition score, calories, unit price, and description. The primary approach is to use pgrep regex with non-capturing groups. You can also head/tail specific line numbers, since their position is constant, but then you would still need to parse those lines. And we prepend the column headers to the output.

ls xx* | while read f; do
item="$(grep -vi ooo "${f}" | head -1 | sed 's/.*/\L&/; s/[a-z]*/\u&/g')"
score="$(tail -1 "${f}" | awk '{print $NF}')"
calories="$(grep -oP "(?<=^)[0-9]{1,}(?=kcal)" "${f}")"
price="$(grep -oP "(?<=\$)[0-9]{1,}\.[0-9]{1,}(?=,)" "${f}")"
description="$(egrep -vi 'ooo|kcal' "${f}" |grep -oP "(?<=^).*[a-z](\.|,)?(?=$)" | tr -d '\n')"
echo "\"${item}\",\"${score}\",\"${calories}\",\"${price}\",\"${description}\""
done | \
(echo "\"ITEM\",\"NUTRITIONAL SCORE\",\"CALORIES/100G\",\"USD/100G\",\"DESCRIPTION\"" && cat) > ~/nutrition.csv
cd ~ && ls -als nutrition.csv
/bin/rm -rf "${tmpdir:-x}" "${tmpfile:-x}"
ls xx* | while read f; do item="$(grep -vi ooo "${f}" | head -1 | sed 's/.*/\L&/; s/[a-z]*/\u&/g')" score="$(tail -1 "${f}" | awk '{print $NF}')" calories="$(grep -oP "(?<=^)[0-9]{1,}(?=kcal)" "${f}")" price="$(grep -oP "(?<=\$)[0-9]{1,}\.[0-9]{1,}(?=,)" "${f}")" description="$(egrep -vi 'ooo|kcal' "${f}" |grep -oP "(?<=^).*[a-z](\.|,)?(?=$)" | tr -d '\n')" echo "\"${item}\",\"${score}\",\"${calories}\",\"${price}\",\"${description}\"" done | \ (echo "\"ITEM\",\"NUTRITIONAL SCORE\",\"CALORIES/100G\",\"USD/100G\",\"DESCRIPTION\"" && cat) > ~/nutrition.csv cd ~ && ls -als nutrition.csv /bin/rm -rf "${tmpdir:-x}" "${tmpfile:-x}"
ls xx* | while read f; do
item="$(grep -vi ooo "${f}" | head -1 | sed 's/.*/\L&/; s/[a-z]*/\u&/g')"
score="$(tail -1 "${f}" | awk '{print $NF}')"
calories="$(grep -oP "(?<=^)[0-9]{1,}(?=kcal)" "${f}")"
price="$(grep -oP "(?<=\$)[0-9]{1,}\.[0-9]{1,}(?=,)" "${f}")"
description="$(egrep -vi 'ooo|kcal' "${f}" |grep -oP "(?<=^).*[a-z](\.|,)?(?=$)" | tr -d '\n')"
echo "\"${item}\",\"${score}\",\"${calories}\",\"${price}\",\"${description}\""
done | \
(echo "\"ITEM\",\"NUTRITIONAL SCORE\",\"CALORIES/100G\",\"USD/100G\",\"DESCRIPTION\"" && cat) > ~/nutrition.csv
cd ~ && ls -als nutrition.csv
/bin/rm -rf "${tmpdir:-x}" "${tmpfile:-x}"

The optional step is to convert the CSV to XLSX with unoconv (read more here).

unoconv -f xlsx -d spreadsheet -o ~/nutrition.xlsx ~/nutrition.csv
file ~/nutrition.xlsx
unoconv -f xlsx -d spreadsheet -o ~/nutrition.xlsx ~/nutrition.csv file ~/nutrition.xlsx
unoconv -f xlsx -d spreadsheet -o ~/nutrition.xlsx ~/nutrition.csv
file ~/nutrition.xlsx

And here’s the end result.

 

Print Friendly, PDF & Email

Leave a Reply