| Scraping a Web Page in BashKrazyWorks

Networking

Unix and Linux network configuration. Multiple network interfaces. Bridged NICs. High-availability network configurations.

Applications

Reviews of latest Unix and Linux software. Helpful tips for application support admins. Automating application support.

Data

Disk partitioning, filesystems, directories, and files. Volume management, logical volumes, HA filesystems. Backups and disaster recovery.

Monitoring

Distributed server monitoring. Server performance and capacity planning. Monitoring applications, network status and user activity.

Commands & Shells

Cool Unix shell commands and options. Command-line tools and application. Things every Unix sysadmin needs to know.

Home » Commands & Shells, Data, Featured

Scraping a Web Page in Bash

Submitted by Igor on September 23, 2021 – 11:33 am

Just a scripting exercise because I need to do something important, but I am procrastinating. The idea is simple: grab some URL with text containing somewhat structured data and convert it into a spreadsheet. I know, exciting…

I will use “The World’s Most Nutritious Foods” article from Pocket and try to make a spreadsheet out of it. The first step part below does these few things:

Downloads the URL and dumps it to text format
Removes leading spaces
Grabs the title line for each food item and the seven subsequent lines
Removes the number list prefix in the title line
Removes any lower-case text in parentheses, i.e. “(lower-case text)”

tmpfile="$(mktemp)"

tmpdir="$(mktemp -d)"

url="https://getpocket.com/explore/item/the-world-s-most-nutritious-foods"

lynx --dump "${url}" | \

sed -r 's/^\s+//g' | \

grep -P '^[0-9]{1,3}\.\s' -A7 --group-separator=$'ooo' | \

sed -r 's/^[0-9]{1,3}\. //g' | \

sed -r 's/ $[a-z]$//g' > "${tmpfile}"

tmpfile="$(mktemp)" tmpdir="$(mktemp -d)" url="https://getpocket.com/explore/item/the-world-s-most-nutritious-foods" lynx --dump "${url}" | \ sed -r 's/^\s+//g' | \ grep -P '^[0-9]{1,3}\.\s' -A7 --group-separator=$'ooo' | \ sed -r 's/^[0-9]{1,3}\. //g' | \ sed -r 's/ $[a-z]$//g' > "${tmpfile}"

tmpfile="$(mktemp)"
tmpdir="$(mktemp -d)"
url="https://getpocket.com/explore/item/the-world-s-most-nutritious-foods"
lynx --dump "${url}" | \
sed -r 's/^\s+//g' | \
grep -P '^[0-9]{1,3}\.\s' -A7 --group-separator=$'ooo' | \
sed -r 's/^[0-9]{1,3}\. //g' | \
sed -r 's/ \([a-z]\)//g' > "${tmpfile}"

Sample output

[root@ncc1711:/tmp/tmp.2oktVowKGn] # cat "${tmpfile}"

SWEET POTATO

86kcal, $0.21, per 100g

A bright orange tuber, sweet potatoes are only distantly related to

potatoes. They are rich in beta-carotene.

NUTRITIONAL SCORE: 49

ooo

FIGS

249kcal, $0.81, per 100g

Figs have been cultivated since ancient times. Eaten fresh or dried,

they are rich in the mineral manganese.

NUTRITIONAL SCORE: 49

ooo

GINGER

80kcal, $0.85, per 100g

Ginger contains high levels of antioxidants. In medicine, it is used as

a digestive stimulant and to treat colds.

NUTRITIONAL SCORE: 49

...

[root@ncc1711:/tmp/tmp.2oktVowKGn] # cat "${tmpfile}" SWEET POTATO 86kcal, $0.21, per 100g A bright orange tuber, sweet potatoes are only distantly related to potatoes. They are rich in beta-carotene. NUTRITIONAL SCORE: 49 ooo FIGS 249kcal, $0.81, per 100g Figs have been cultivated since ancient times. Eaten fresh or dried, they are rich in the mineral manganese. NUTRITIONAL SCORE: 49 ooo GINGER 80kcal, $0.85, per 100g Ginger contains high levels of antioxidants. In medicine, it is used as a digestive stimulant and to treat colds. NUTRITIONAL SCORE: 49 ...

[root@ncc1711:/tmp/tmp.2oktVowKGn] # cat "${tmpfile}"
SWEET POTATO

86kcal, $0.21, per 100g

A bright orange tuber, sweet potatoes are only distantly related to
potatoes. They are rich in beta-carotene.

NUTRITIONAL SCORE: 49
ooo
FIGS

249kcal, $0.81, per 100g

Figs have been cultivated since ancient times. Eaten fresh or dried,
they are rich in the mineral manganese.

NUTRITIONAL SCORE: 49
ooo
GINGER

80kcal, $0.85, per 100g

Ginger contains high levels of antioxidants. In medicine, it is used as
a digestive stimulant and to treat colds.

NUTRITIONAL SCORE: 49
...

All these steps, of course, are specific to the text you’re parsing. There is no universal approach to a task such as this.

The second step splits the file on the “ooo” separator and removes the trailing split file (that contains nothing of interest).

cd "${tmpdir}" && csplit -k "${tmpfile}" '/^ooo/' "{$(grep -c 'ooo' "${tmpfile}")}" 2>/dev/null

/bin/rm -f "$(ls | sort -V | tail -1)"

cd "${tmpdir}" && csplit -k "${tmpfile}" '/^ooo/' "{$(grep -c 'ooo' "${tmpfile}")}" 2>/dev/null /bin/rm -f "$(ls | sort -V | tail -1)"

cd "${tmpdir}" && csplit -k "${tmpfile}" '/^ooo/' "{$(grep -c 'ooo' "${tmpfile}")}" 2>/dev/null
/bin/rm -f "$(ls | sort -V | tail -1)"

Sample output

[root@ncc1711:/tmp/tmp.2oktVowKGn] # ls

xx00 xx04 xx08 xx12 xx16 xx20 xx24 xx28 xx32 xx36 xx40 xx44 xx48 xx52 xx56 xx60 xx64 xx68 xx72 xx76 xx80 xx84 xx88 xx92 xx96

xx01 xx05 xx09 xx13 xx17 xx21 xx25 xx29 xx33 xx37 xx41 xx45 xx49 xx53 xx57 xx61 xx65 xx69 xx73 xx77 xx81 xx85 xx89 xx93 xx97

xx02 xx06 xx10 xx14 xx18 xx22 xx26 xx30 xx34 xx38 xx42 xx46 xx50 xx54 xx58 xx62 xx66 xx70 xx74 xx78 xx82 xx86 xx90 xx94 xx98

xx03 xx07 xx11 xx15 xx19 xx23 xx27 xx31 xx35 xx39 xx43 xx47 xx51 xx55 xx59 xx63 xx67 xx71 xx75 xx79 xx83 xx87 xx91 xx95 xx99

[root@ncc1711:/tmp/tmp.2oktVowKGn] # cat xx04

ooo

BURDOCK ROOT

72kcal, $1.98, per 100g

Used in folk medicine and as a vegetable, studies suggest burdock can

aid fat loss and limit inflammation.

NUTRITIONAL SCORE: 50

[root@ncc1711:/tmp/tmp.2oktVowKGn] # ls xx00 xx04 xx08 xx12 xx16 xx20 xx24 xx28 xx32 xx36 xx40 xx44 xx48 xx52 xx56 xx60 xx64 xx68 xx72 xx76 xx80 xx84 xx88 xx92 xx96 xx01 xx05 xx09 xx13 xx17 xx21 xx25 xx29 xx33 xx37 xx41 xx45 xx49 xx53 xx57 xx61 xx65 xx69 xx73 xx77 xx81 xx85 xx89 xx93 xx97 xx02 xx06 xx10 xx14 xx18 xx22 xx26 xx30 xx34 xx38 xx42 xx46 xx50 xx54 xx58 xx62 xx66 xx70 xx74 xx78 xx82 xx86 xx90 xx94 xx98 xx03 xx07 xx11 xx15 xx19 xx23 xx27 xx31 xx35 xx39 xx43 xx47 xx51 xx55 xx59 xx63 xx67 xx71 xx75 xx79 xx83 xx87 xx91 xx95 xx99 [root@ncc1711:/tmp/tmp.2oktVowKGn] # cat xx04 ooo BURDOCK ROOT 72kcal, $1.98, per 100g Used in folk medicine and as a vegetable, studies suggest burdock can aid fat loss and limit inflammation. NUTRITIONAL SCORE: 50

[root@ncc1711:/tmp/tmp.2oktVowKGn] # ls
xx00  xx04  xx08  xx12  xx16  xx20  xx24  xx28  xx32  xx36  xx40  xx44  xx48  xx52  xx56  xx60  xx64  xx68  xx72  xx76  xx80  xx84  xx88  xx92  xx96
xx01  xx05  xx09  xx13  xx17  xx21  xx25  xx29  xx33  xx37  xx41  xx45  xx49  xx53  xx57  xx61  xx65  xx69  xx73  xx77  xx81  xx85  xx89  xx93  xx97
xx02  xx06  xx10  xx14  xx18  xx22  xx26  xx30  xx34  xx38  xx42  xx46  xx50  xx54  xx58  xx62  xx66  xx70  xx74  xx78  xx82  xx86  xx90  xx94  xx98
xx03  xx07  xx11  xx15  xx19  xx23  xx27  xx31  xx35  xx39  xx43  xx47  xx51  xx55  xx59  xx63  xx67  xx71  xx75  xx79  xx83  xx87  xx91  xx95  xx99

[root@ncc1711:/tmp/tmp.2oktVowKGn] # cat xx04
ooo
BURDOCK ROOT

72kcal, $1.98, per 100g

Used in folk medicine and as a vegetable, studies suggest burdock can
aid fat loss and limit inflammation.

NUTRITIONAL SCORE: 50

Finally, we read each xx* file and extract the five elements: item name, nutrition score, calories, unit price, and description. The primary approach is to use pgrep regex with non-capturing groups. You can also head/tail specific line numbers, since their position is constant, but then you would still need to parse those lines. And we prepend the column headers to the output.

ls xx* | while read f; do

item="$(grep -vi ooo "${f}" | head -1 | sed 's/.*/\L&/; s/[a-z]*/\u&/g')"

score="$(tail -1 "${f}" | awk '{print $NF}')"

calories="$(grep -oP "(?<=^)[0-9]{1,}(?=kcal)" "${f}")"

price="$(grep -oP "(?<=\$)[0-9]{1,}\.[0-9]{1,}(?=,)" "${f}")"

description="$(egrep -vi 'ooo|kcal' "${f}" |grep -oP "(?<=^).*[a-z](\.|,)?(?=$)" | tr -d '\n')"

echo "\"${item}\",\"${score}\",\"${calories}\",\"${price}\",\"${description}\""

done | \

(echo "\"ITEM\",\"NUTRITIONAL SCORE\",\"CALORIES/100G\",\"USD/100G\",\"DESCRIPTION\"" && cat) > ~/nutrition.csv

cd ~ && ls -als nutrition.csv

/bin/rm -rf "${tmpdir:-x}" "${tmpfile:-x}"

ls xx* | while read f; do item="$(grep -vi ooo "${f}" | head -1 | sed 's/.*/\L&/; s/[a-z]*/\u&/g')" score="$(tail -1 "${f}" | awk '{print $NF}')" calories="$(grep -oP "(?<=^)[0-9]{1,}(?=kcal)" "${f}")" price="$(grep -oP "(?<=\$)[0-9]{1,}\.[0-9]{1,}(?=,)" "${f}")" description="$(egrep -vi 'ooo|kcal' "${f}" |grep -oP "(?<=^).*[a-z](\.|,)?(?=$)" | tr -d '\n')" echo "\"${item}\",\"${score}\",\"${calories}\",\"${price}\",\"${description}\"" done | \ (echo "\"ITEM\",\"NUTRITIONAL SCORE\",\"CALORIES/100G\",\"USD/100G\",\"DESCRIPTION\"" && cat) > ~/nutrition.csv cd ~ && ls -als nutrition.csv /bin/rm -rf "${tmpdir:-x}" "${tmpfile:-x}"

ls xx* | while read f; do
item="$(grep -vi ooo "${f}" | head -1 | sed 's/.*/\L&/; s/[a-z]*/\u&/g')"
score="$(tail -1 "${f}" | awk '{print $NF}')"
calories="$(grep -oP "(?<=^)[0-9]{1,}(?=kcal)" "${f}")"
price="$(grep -oP "(?<=\$)[0-9]{1,}\.[0-9]{1,}(?=,)" "${f}")"
description="$(egrep -vi 'ooo|kcal' "${f}" |grep -oP "(?<=^).*[a-z](\.|,)?(?=$)" | tr -d '\n')"
echo "\"${item}\",\"${score}\",\"${calories}\",\"${price}\",\"${description}\""
done | \
(echo "\"ITEM\",\"NUTRITIONAL SCORE\",\"CALORIES/100G\",\"USD/100G\",\"DESCRIPTION\"" && cat) > ~/nutrition.csv
cd ~ && ls -als nutrition.csv
/bin/rm -rf "${tmpdir:-x}" "${tmpfile:-x}"

The optional step is to convert the CSV to XLSX with unoconv (read more here).

unoconv -f xlsx -d spreadsheet -o ~/nutrition.xlsx ~/nutrition.csv

file ~/nutrition.xlsx

unoconv -f xlsx -d spreadsheet -o ~/nutrition.xlsx ~/nutrition.csv file ~/nutrition.xlsx

unoconv -f xlsx -d spreadsheet -o ~/nutrition.xlsx ~/nutrition.csv
file ~/nutrition.xlsx

And here’s the end result.

Show entries

Search:

ITEM	NUTRITIONAL SCORE	CALORIES/100G	USD/100G	DESCRIPTION
Sweet Potato	49	86	0.21	A bright orange tuber, sweet potatoes are only distantly related to potatoes. They are rich in beta-carotene.
Figs	49	249	0.81	Figs have been cultivated since ancient times. Eaten fresh or dried, they are rich in the mineral manganese.
Ginger	49	80	0.85	Ginger contains high levels of antioxidants. In medicine, it is used as a digestive stimulant and to treat colds.
Pumpkin	50	26	0.2	Pumpkins are rich in yellow and orange pigments. Especially xanthophyll esters and beta-carotene.
Burdock Root	50	72	1.98	Used in folk medicine and as a vegetable, studies suggest burdock can aid fat loss and limit inflammation.
Brussels Sprouts	50	43	0.35	A type of cabbage. Brussels sprouts originated in Brussels in the 1500s. They are rich in calcium and vitamin C.
Broccoli	50	34	0.42	Broccoli heads consist of immature flower buds and stems. US consumption has risen five-fold in 50 years.
Cauliflower	50	31	0.44	Unlike broccoli, cauliflower heads are degenerate shoot tips that are frequently white, lacking green chlorophyll.
Water Chestnuts	50	97	1.5	The water chestnut is not a nut at all, but an aquatic vegetable that grows in mud underwater within marshes.
Cantaloupe Melons	50	34	0.27	One of the foods richest in glutathione, an antioxidant that protects cells from toxins including free radicals.