Just a scripting exercise because I need to do something important, but I am procrastinating. The idea is simple: grab some URL with text containing somewhat structured data and convert it into a spreadsheet. I know, exciting…
I will use “The World’s Most Nutritious Foods” article from Pocket and try to make a spreadsheet out of it. The first step part below does these few things:
- Downloads the URL and dumps it to text format
- Removes leading spaces
- Grabs the title line for each food item and the seven subsequent lines
- Removes the number list prefix in the title line
- Removes any lower-case text in parentheses, i.e. “(lower-case text)”
url="https://getpocket.com/explore/item/the-world-s-most-nutritious-foods"
grep -P '^[0-9]{1,3}\.\s' -A7 --group-separator=$'ooo' | \
sed -r 's/^[0-9]{1,3}\. //g' | \
sed -r 's/ \([a-z]\)//g' > "${tmpfile}"
tmpfile="$(mktemp)"
tmpdir="$(mktemp -d)"
url="https://getpocket.com/explore/item/the-world-s-most-nutritious-foods"
lynx --dump "${url}" | \
sed -r 's/^\s+//g' | \
grep -P '^[0-9]{1,3}\.\s' -A7 --group-separator=$'ooo' | \
sed -r 's/^[0-9]{1,3}\. //g' | \
sed -r 's/ \([a-z]\)//g' > "${tmpfile}"
tmpfile="$(mktemp)"
tmpdir="$(mktemp -d)"
url="https://getpocket.com/explore/item/the-world-s-most-nutritious-foods"
lynx --dump "${url}" | \
sed -r 's/^\s+//g' | \
grep -P '^[0-9]{1,3}\.\s' -A7 --group-separator=$'ooo' | \
sed -r 's/^[0-9]{1,3}\. //g' | \
sed -r 's/ \([a-z]\)//g' > "${tmpfile}"
Sample output
[root@ncc1711:/tmp/tmp.2oktVowKGn] # cat "${tmpfile}"
A bright orange tuber, sweet potatoes are only distantly related to
potatoes. They are rich in beta-carotene.
Figs have been cultivated since ancient times. Eaten fresh or dried,
they are rich in the mineral manganese.
Ginger contains high levels of antioxidants. In medicine, it is used as
a digestive stimulant and to treat colds.
[root@ncc1711:/tmp/tmp.2oktVowKGn] # cat "${tmpfile}"
SWEET POTATO
86kcal, $0.21, per 100g
A bright orange tuber, sweet potatoes are only distantly related to
potatoes. They are rich in beta-carotene.
NUTRITIONAL SCORE: 49
ooo
FIGS
249kcal, $0.81, per 100g
Figs have been cultivated since ancient times. Eaten fresh or dried,
they are rich in the mineral manganese.
NUTRITIONAL SCORE: 49
ooo
GINGER
80kcal, $0.85, per 100g
Ginger contains high levels of antioxidants. In medicine, it is used as
a digestive stimulant and to treat colds.
NUTRITIONAL SCORE: 49
...
[root@ncc1711:/tmp/tmp.2oktVowKGn] # cat "${tmpfile}"
SWEET POTATO
86kcal, $0.21, per 100g
A bright orange tuber, sweet potatoes are only distantly related to
potatoes. They are rich in beta-carotene.
NUTRITIONAL SCORE: 49
ooo
FIGS
249kcal, $0.81, per 100g
Figs have been cultivated since ancient times. Eaten fresh or dried,
they are rich in the mineral manganese.
NUTRITIONAL SCORE: 49
ooo
GINGER
80kcal, $0.85, per 100g
Ginger contains high levels of antioxidants. In medicine, it is used as
a digestive stimulant and to treat colds.
NUTRITIONAL SCORE: 49
...
All these steps, of course, are specific to the text you’re parsing. There is no universal approach to a task such as this.
The second step splits the file on the “ooo” separator and removes the trailing split file (that contains nothing of interest).
cd "${tmpdir}" && csplit -k "${tmpfile}" '/^ooo/' "{$(grep -c 'ooo' "${tmpfile}")}" 2>/dev/null
/bin/rm -f "$(ls | sort -V | tail -1)"
cd "${tmpdir}" && csplit -k "${tmpfile}" '/^ooo/' "{$(grep -c 'ooo' "${tmpfile}")}" 2>/dev/null
/bin/rm -f "$(ls | sort -V | tail -1)"
cd "${tmpdir}" && csplit -k "${tmpfile}" '/^ooo/' "{$(grep -c 'ooo' "${tmpfile}")}" 2>/dev/null
/bin/rm -f "$(ls | sort -V | tail -1)"
Sample output
[root@ncc1711:/tmp/tmp.2oktVowKGn] # ls
xx00 xx04 xx08 xx12 xx16 xx20 xx24 xx28 xx32 xx36 xx40 xx44 xx48 xx52 xx56 xx60 xx64 xx68 xx72 xx76 xx80 xx84 xx88 xx92 xx96
xx01 xx05 xx09 xx13 xx17 xx21 xx25 xx29 xx33 xx37 xx41 xx45 xx49 xx53 xx57 xx61 xx65 xx69 xx73 xx77 xx81 xx85 xx89 xx93 xx97
xx02 xx06 xx10 xx14 xx18 xx22 xx26 xx30 xx34 xx38 xx42 xx46 xx50 xx54 xx58 xx62 xx66 xx70 xx74 xx78 xx82 xx86 xx90 xx94 xx98
xx03 xx07 xx11 xx15 xx19 xx23 xx27 xx31 xx35 xx39 xx43 xx47 xx51 xx55 xx59 xx63 xx67 xx71 xx75 xx79 xx83 xx87 xx91 xx95 xx99
[root@ncc1711:/tmp/tmp.2oktVowKGn] # cat xx04
Used in folk medicine and as a vegetable, studies suggest burdock can
aid fat loss and limit inflammation.
[root@ncc1711:/tmp/tmp.2oktVowKGn] # ls
xx00 xx04 xx08 xx12 xx16 xx20 xx24 xx28 xx32 xx36 xx40 xx44 xx48 xx52 xx56 xx60 xx64 xx68 xx72 xx76 xx80 xx84 xx88 xx92 xx96
xx01 xx05 xx09 xx13 xx17 xx21 xx25 xx29 xx33 xx37 xx41 xx45 xx49 xx53 xx57 xx61 xx65 xx69 xx73 xx77 xx81 xx85 xx89 xx93 xx97
xx02 xx06 xx10 xx14 xx18 xx22 xx26 xx30 xx34 xx38 xx42 xx46 xx50 xx54 xx58 xx62 xx66 xx70 xx74 xx78 xx82 xx86 xx90 xx94 xx98
xx03 xx07 xx11 xx15 xx19 xx23 xx27 xx31 xx35 xx39 xx43 xx47 xx51 xx55 xx59 xx63 xx67 xx71 xx75 xx79 xx83 xx87 xx91 xx95 xx99
[root@ncc1711:/tmp/tmp.2oktVowKGn] # cat xx04
ooo
BURDOCK ROOT
72kcal, $1.98, per 100g
Used in folk medicine and as a vegetable, studies suggest burdock can
aid fat loss and limit inflammation.
NUTRITIONAL SCORE: 50
[root@ncc1711:/tmp/tmp.2oktVowKGn] # ls
xx00 xx04 xx08 xx12 xx16 xx20 xx24 xx28 xx32 xx36 xx40 xx44 xx48 xx52 xx56 xx60 xx64 xx68 xx72 xx76 xx80 xx84 xx88 xx92 xx96
xx01 xx05 xx09 xx13 xx17 xx21 xx25 xx29 xx33 xx37 xx41 xx45 xx49 xx53 xx57 xx61 xx65 xx69 xx73 xx77 xx81 xx85 xx89 xx93 xx97
xx02 xx06 xx10 xx14 xx18 xx22 xx26 xx30 xx34 xx38 xx42 xx46 xx50 xx54 xx58 xx62 xx66 xx70 xx74 xx78 xx82 xx86 xx90 xx94 xx98
xx03 xx07 xx11 xx15 xx19 xx23 xx27 xx31 xx35 xx39 xx43 xx47 xx51 xx55 xx59 xx63 xx67 xx71 xx75 xx79 xx83 xx87 xx91 xx95 xx99
[root@ncc1711:/tmp/tmp.2oktVowKGn] # cat xx04
ooo
BURDOCK ROOT
72kcal, $1.98, per 100g
Used in folk medicine and as a vegetable, studies suggest burdock can
aid fat loss and limit inflammation.
NUTRITIONAL SCORE: 50
Finally, we read each xx*
file and extract the five elements: item name, nutrition score, calories, unit price, and description. The primary approach is to use pgrep
regex with non-capturing groups. You can also head/tail
specific line numbers, since their position is constant, but then you would still need to parse those lines. And we prepend the column headers to the output.
ls xx* | while read f; do
item="$(grep -vi ooo "${f}" | head -1 | sed 's/.*/\L&/; s/[a-z]*/\u&/g')"
score="$(tail -1 "${f}" | awk '{print $NF}')"
calories="$(grep -oP "(?<=^)[0-9]{1,}(?=kcal)" "${f}")"
price="$(grep -oP "(?<=\$)[0-9]{1,}\.[0-9]{1,}(?=,)" "${f}")"
description="$(egrep -vi 'ooo|kcal' "${f}" |grep -oP "(?<=^).*[a-z](\.|,)?(?=$)" | tr -d '\n')"
echo "\"${item}\",\"${score}\",\"${calories}\",\"${price}\",\"${description}\""
(echo "\"ITEM\",\"NUTRITIONAL SCORE\",\"CALORIES/100G\",\"USD/100G\",\"DESCRIPTION\"" && cat) > ~/nutrition.csv
cd ~ && ls -als nutrition.csv
/bin/rm -rf "${tmpdir:-x}" "${tmpfile:-x}"
ls xx* | while read f; do
item="$(grep -vi ooo "${f}" | head -1 | sed 's/.*/\L&/; s/[a-z]*/\u&/g')"
score="$(tail -1 "${f}" | awk '{print $NF}')"
calories="$(grep -oP "(?<=^)[0-9]{1,}(?=kcal)" "${f}")"
price="$(grep -oP "(?<=\$)[0-9]{1,}\.[0-9]{1,}(?=,)" "${f}")"
description="$(egrep -vi 'ooo|kcal' "${f}" |grep -oP "(?<=^).*[a-z](\.|,)?(?=$)" | tr -d '\n')"
echo "\"${item}\",\"${score}\",\"${calories}\",\"${price}\",\"${description}\""
done | \
(echo "\"ITEM\",\"NUTRITIONAL SCORE\",\"CALORIES/100G\",\"USD/100G\",\"DESCRIPTION\"" && cat) > ~/nutrition.csv
cd ~ && ls -als nutrition.csv
/bin/rm -rf "${tmpdir:-x}" "${tmpfile:-x}"
ls xx* | while read f; do
item="$(grep -vi ooo "${f}" | head -1 | sed 's/.*/\L&/; s/[a-z]*/\u&/g')"
score="$(tail -1 "${f}" | awk '{print $NF}')"
calories="$(grep -oP "(?<=^)[0-9]{1,}(?=kcal)" "${f}")"
price="$(grep -oP "(?<=\$)[0-9]{1,}\.[0-9]{1,}(?=,)" "${f}")"
description="$(egrep -vi 'ooo|kcal' "${f}" |grep -oP "(?<=^).*[a-z](\.|,)?(?=$)" | tr -d '\n')"
echo "\"${item}\",\"${score}\",\"${calories}\",\"${price}\",\"${description}\""
done | \
(echo "\"ITEM\",\"NUTRITIONAL SCORE\",\"CALORIES/100G\",\"USD/100G\",\"DESCRIPTION\"" && cat) > ~/nutrition.csv
cd ~ && ls -als nutrition.csv
/bin/rm -rf "${tmpdir:-x}" "${tmpfile:-x}"
The optional step is to convert the CSV to XLSX with unoconv
(read more here).
unoconv -f xlsx -d spreadsheet -o ~/nutrition.xlsx ~/nutrition.csv
unoconv -f xlsx -d spreadsheet -o ~/nutrition.xlsx ~/nutrition.csv
file ~/nutrition.xlsx
unoconv -f xlsx -d spreadsheet -o ~/nutrition.xlsx ~/nutrition.csv
file ~/nutrition.xlsx
And here’s the end result.