CLI Web Spider

From WikiMLT

Here is a short man­u­al that shows three ways how to cre­ate a sim­ple web spi­der by us­ing CLI on Linux/​​​Ubuntu. The source of this ar­ti­cle is the ques­tion of mine How do I cre­ate a CLI Web Spi­der that us­es key­words and fil­ters con­tent? at Ask Ubun­tu. Be­fore ask­ing the ques­tion, I've read few top­ics as fol­low, but none of the so­lu­tions fits to my needs.

The task: How to cre­ate a sim­ple Web Spi­der in Ubun­tu CLI

I want to find my ar­ti­cles with­in the dep­re­cat­ed (ob­so­lete) lit­er­a­ture fo­rum e‑bane.net. Some of the fo­rum mod­ules are dis­abled, and I can't get a list of ar­ti­cles by their au­thor. Al­so the site is not in­dexed by the search en­gines as Google, Yn­dex, etc.

The on­ly way to find all of my ar­ti­cles is to open the archive page of the site (Fig­ure 1). Then I must se­lect cer­tain year and month – e.g. Jan­u­ary 2013 (Fig­ure 1). And then I must in­spect each ar­ti­cle (Fig­ure 2) whether in the be­gin­ning is writ­ten my nick­name – pa4080 (Fig­ure 3). But there are few thou­sand ar­ti­cles.

So­lu­tion 1: Bash script that us­es Wget

To solve this task I've cre­at­ed the next sim­ple bash script that main­ly us­es the CLI tool wget.

#!/bin/bash
# Name: web-spider.wget.sh

TARGET_URL='<nowiki>http://e-bane.net/modules.php?name=Stories_Archive'</nowiki>
KEY_WORDS=('pa4080' 's0ther')
MAP_FILE='url.map'
OUT_FILE='url.list'

get_url_map() {     # Use 'wget' as spider and output the result into a file (and stdout)
    wget --spider --force-html -r -l2 "${TARGET_URL}" 2>&1 | grep '^--' | awk '{ print $3 }' | tee -a "$MAP_FILE"
}

filter_url_map() {  # Apply some filters to the $MAP_FILE and keep only the URLs, that contain 'article&sid'
    uniq "$MAP_FILE" | grep -v '\.\(css\|js\|png\|gif\|jpg\|txt\)$' | grep 'article&sid' | sort -u > "${MAP_FILE}.uniq && mv "${MAP_FILE}.uniq" "$MAP_FILE"
    printf '\n# -----\nThe number of the pages to be scanned: %s\n' "$(cat "$MAP_FILE" | wc -l)"
}

get_key_urls() {
    counter=1
    while IFS= read -r URL; do                 # Do this for each line in the $MAP_FILE
        for KEY_WORD in "${KEY_WORDS[@]}"; do  # For each $KEY_WORD in $KEY_WORDS
            if [[ ! -z "$(wget -qO- "${URL}" | grep -io "${KEY_WORD}" | head -n1)" ]]; then  # Check if the $KEY_WORD exists within the content of the page,
                echo "${URL}" | tee -a "$OUT_FILE"                                           # if it is true echo the particular $URL into the $OUT_FILE
                printf '%s\t%s\n' "${KEY_WORD}" "YES"
            fi
        done
        printf 'Progress: %s\r' "$counter"; ((counter++))
    done < "$MAP_FILE"
}

# Call the functions
get_url_map
filter_url_map
get_key_urls
Figure 4. The spi­der script dur­ing the work­ing process.

The script has three func­tions:

  • The first func­tion get_​​​url_​​​map() us­es wget as –spi­der (which means that it will just check that pages are there) and will cre­ate re­cur­sive -r URL $MAP_FILE of the $TARGET_URL with depth lev­el -l2. (An­oth­er ex­am­ple could be found here: Con­vert Web­site to PDF). In the cur­rent case the $MAP_FILE con­tains about 20 000 URLs.
  • The third func­tion get_​​​key_​​​urls() will use wget ‑qO- (as the com­mand curl – ex­am­ples) to out­put the con­tent of each URL from the $MAP_FILE and will try to find any of the $KEY_WORDSwith­in it. If any of the $KEY_WORDS is found­ed with­in the con­tent of any par­tic­u­lar URL, that URL will be saved in the $OUT_FILE.

Dur­ing the work­ing process the out­put of the script looks as it is shown on Fig­ure 4. It takes about 63 min­utes to fin­ish.

So­lu­tion 2: Python3 script

This so­lu­tion is pro­vid­ed by @dan as an­swer on my ques­tion on AskUbun­tu.

#!/usr/bin/python3
# Name: web-spirer.py

from urllib.parse import urljoin
import json

import bs4
import click
import aiohttp
import asyncio
import async_timeout


BASE_URL = 'http://e-bane.net'


async def fetch(session, url):
    try:
        with async_timeout.timeout(20):
            async with session.get(url) as response:
                return await response.text()
    except asyncio.TimeoutError as e:
        print('[{}]{}'.format('timeout error', url))
        with async_timeout.timeout(20):
            async with session.get(url) as response:
                return await response.text()


async def get_result(user):
    target_url = 'http://e-bane.net/modules.php?name=Stories_Archive'
    res = []
    async with aiohttp.ClientSession() as session:
        html = await fetch(session, target_url)
        html_soup = bs4.BeautifulSoup(html, 'html.parser')
        date_module_links = parse_date_module_links(html_soup)
        for dm_link in date_module_links:
            html = await fetch(session, dm_link)
            html_soup = bs4.BeautifulSoup(html, 'html.parser')
            thread_links = parse_thread_links(html_soup)
            print('[{}]{}'.format(len(thread_links), dm_link))
            for t_link in thread_links:
                thread_html = await fetch(session, t_link)
                t_html_soup = bs4.BeautifulSoup(thread_html, 'html.parser')
                if is_article_match(t_html_soup, user):
                    print('[v]{}'.format(t_link))
                    # to get main article, uncomment below code
                    # res.append(get_main_article(t_html_soup))
                    # code below is used to get thread link
                    res.append(t_link)
                else:
                    print('[x]{}'.format(t_link))

        return res


def parse_date_module_links(page):
    a_tags = page.select('ul li a')
    hrefs = a_tags = [x.get('href') for x in a_tags]
    return [urljoin(BASE_URL, x) for x in hrefs]


def parse_thread_links(page):
    a_tags = page.select('table table  tr  td > a')
    hrefs = a_tags = [x.get('href') for x in a_tags]
    # filter href with 'file=article'
    valid_hrefs = [x for x in hrefs if 'file=article' in x]
    return [urljoin(BASE_URL, x) for x in valid_hrefs]


def is_article_match(page, user):
    main_article = get_main_article(page)
    return main_article.text.startswith(user)


def get_main_article(page):
    td_tags = page.select('table table td.row1')
    td_tag = td_tags[4]
    return td_tag


@click.command()
@click.argument('user')
@click.option('--output-filename', default='out.json', help='Output filename.')
def main(user, output_filename):
    loop = asyncio.get_event_loop()
    res = loop.run_until_complete(get_result(user))
    # if you want to return main article, convert html soup into text
    # text_res = [x.text for x in res]
    # else just put res on text_res
    text_res = res
    with open(output_filename, 'w') as f:
        json.dump(text_res, f)


if __name__ == '__main__':
    main()

$ cat requirement.txt

aiohttp>=2.3.7
eautifulsoup4>=4.6.0
click>=6.7

Here is python3 ver­sion of the script (test­ed on python3.5 on Ubun­tu 17.10).

How to use:

  • To use it put both code in files. As ex­am­ple the code file is script​.py and pack­age file is requirement.txt.
  • Run pip in­stall ‑r requirement.txt.
  • Run the script as ex­am­ple python3 script​.py pa4080

It us­es sev­er­al li­braryes:

  • click for ar­gu­ment pars­er
  • beau­ti­ful­soup for html pars­er
  • aio­http for html down­loader

Things to know to de­vel­op the pro­gram fur­ther (oth­er than the doc of re­quired pack­age):

  • python li­brary: asyn­cio, json and urllib.parse
  • css se­lec­tors (mdn web docs), al­so some html. see al­so how to use css se­lec­tor on your brows­er such as this ar­ti­cle

How it works:

  • First I cre­ate a sim­ple html down­loader. It is mod­i­fied ver­sion from the sam­ple giv­en on aio­http doc.
  • Af­ter that cre­at­ing sim­ple com­mand line pars­er which ac­cept user­name and out­put file­name.
  • Cre­ate a pars­er for thread links and main ar­ti­cle. Us­ing pdb and sim­ple url ma­nip­u­la­tion should do the job.
  • Com­bine the func­tion and put the main ar­ti­cle on json, so oth­er pro­gram can process it lat­er.

Some idea so it can be de­vel­oped fur­ther

  • Cre­ate an­oth­er sub­com­mand that ac­cept date mod­ule link: it can be done by sep­a­rat­ing the method to parse the date mod­ule to its own func­tion and com­bine it with new sub­com­mand.
  • Caching the date mod­ule link: cre­ate cache json file af­ter get­ting threads link. so the pro­gram don't have to parse the link again. or even just cache the en­tire thread main ar­ti­cle even if it doesn't match

This is not the most el­e­gant an­swer, but I think it is bet­ter than us­ing bash an­swer.

  • It use Python, which mean it can be used cross plat­form.
  • Sim­ple in­stal­la­tion, all re­quired pack­age can be in­stalled us­ing pip
  • It can be de­vel­oped fur­ther, more read­able the pro­gram, eas­i­er it can be de­vel­oped.
  • It does the same job as the above bash script on­ly for 13 min­utes.

So­lu­tion 3: Bash script that us­es Lynx

I've im­proved my script based on this an­swer pro­vid­ed by @karel. Now the script us­es lynx in­stead of wget. In re­sult it be­comes sig­nif­i­cant­ly faster.

The cur­rent ver­sion does the same job for 15 min­utes when there are two searched key­words and on­ly 8 min­utes if we search­ing for on­ly one key­word. That is faster than the Python so­lu­tion pro­vid­ed by @dan.

In ad­di­tion lynx pro­vides bet­ter han­dling of non Latin char­ac­ters.

#!/bin/bash
# Name:  web-spider.lynx.sh

TARGET_URL='http://e-bane.net/modules.php?name=Stories_Archive'
KEY_WORDS=('pa4080')  # KEY_WORDS=('word' 'some short sentence')
MAP_FILE='url.map'
OUT_FILE='url.list'

get_url_map() {
    # Use 'lynx' as spider and output the result into a file 
    lynx -dump "${TARGET_URL}" | awk '/http/{print $2}' | uniq -u > "$MAP_FILE"
    while IFS= read -r target_url; do lynx -dump "${target_url}" | awk '/http/{print $2}' | uniq -u >> "${MAP_FILE}.full"; done < "$MAP_FILE"
    mv "${MAP_FILE}.full" "$MAP_FILE"
}

filter_url_map() {
    # Apply some filters to the $MAP_FILE and keep only the URLs, that contain 'article&sid'
    uniq "$MAP_FILE" | grep -v '\.\(css\|js\|png\|gif\|jpg\|txt\)$' | grep 'article&sid' | sort -u > "${MAP_FILE}.uniq"
    mv "${MAP_FILE}.uniq" "$MAP_FILE"
    printf '\n# -----\nThe number of the pages to be scanned: %s\n' "$(cat "$MAP_FILE" | wc -l)"
}

get_key_urls() {
    counter=1
    # Do this for each line in the $MAP_FILE
    while IFS= read -r URL; do
        # For each $KEY_WORD in $KEY_WORDS
        for KEY_WORD in "${KEY_WORDS[@]}"; do
            # Check if the $KEY_WORD exists within the content of the page, if it is true echo the particular $URL into the $OUT_FILE
            if [[ ! -z "$(lynx -dump -nolist "${URL}" | grep -io "${KEY_WORD}" | head -n1)" ]]; then
                echo "${URL}" | tee -a "$OUT_FILE"
                printf '%s\t%s\n' "${KEY_WORD}" "YES"
            fi
        done
        printf 'Progress: %s\r' "$counter"; ((counter++))
    done < "$MAP_FILE"
}

# Call the functions
get_url_map
filter_url_map
get_key_urls