WikiD Instructables

HTML Read

Version 1.310709

AutoTools HTML Read function. First off, let's explain what this feature does. In a nutshell, it is a web scraping tool. In laymans terms, it is a feature to retrieve just the contents you want from a webpage, say an image or the current playing song on your favorite radio show.

Now let's go through the menus to explain what each bit does. At least that way, you become a little more familiar with what it has to offer.

Easy Setup

Difficulty
Easy
Literally as it sounds. You offer it the URL you see in your Browser, you give it something to look for (text or images) and it guides you through choosing the portions you want and filling in the CSS Queries for you. This mode suffices for most tasks, since it's usually things like titles, headers, lists, URLs and images etc. that we are after. This mode does not require you to know any HTML at all.

Manual Setup

Difficulty
Hard
This mode is for you to choose your own queries. This can be useful for a number of reasons, especially if you are after data that may not be found using Easy Setup, or for when you're learning JSOUP, CSS, JS and HTML. This mode does require you understand web page source code (HTML) and how JSOUP queries are made. The instructables here will try to explain simple guidance on this, so you understand a little more on what the queries are actually doing.

URL

Either the URL of the webpage containing the information, a file on your local storage or a variable containing HTML. If it is a URL from a webpage, it will need to be the exact URL as you see in a web browser. Some sites require authentication, which can be either authenticated using the Authenticate feature (explained below) or if the site supports BASIC Auth in the form of https://user:password@website/url/page.php for example. Any page that delivers HTML as it's response can be used (whether the page ends with .html .php .js .cgi etc).

CSS Queries

This is where things get complex, both to learn and to explain. Using a specific syntax called JSOUP, it allows us to easily search for specific tags, elements, ids and classes within a web page. When you do Easy Setup, you will notice how this box becomes filled. Understanding HTML is really important here, you do not have to an expert web developer, but ability to be able to understand some of it will greatly help.

The querying is done using JSOUP Selector; https://jsoup.org/apidocs/org/jsoup/select/Selector.html

* More information on CSS Queries

Variable names

Our CSS Queries could end up becoming very long and daunting to look at. In turn, the variables AutoTools will generate may be equally as long and daunting. Using this area, we can use whatever name we want for these Queries. So say we have one large search Query, div.image-box img()=:=src, we can set a name here for images and have a simple Array called %images() to use instead. If we are doing multiple search, comma separated, simply give the variables an equal amount of comma separated names.

Joiner

If you don't want an Array and instead have your data within a singular variable, set this field and the entries from the array will become a variable instead. Each entry will then be separated by whatever character you use here. Most commonly, the comma is used.

Output HTML

Instead of just pulling the information you want, this pulls the entire HTML code. This can be handy to retrieve the source code of a page to paste into / share with a Text editor for viewing.

Advanced

Use Javascript

Some websites depend on JavaScript to deliver their content. Sometimes it will be impossible to get the data you want without setting this option. Fortunately you wont need to know JavaScript to use this feature, it simply allows AutoTools to render the page properly so you can extract the data you want.

JavaScript Delay

Wait this amount of time (in milliseconds) after loading the page to wait to allow JavaScript to render content. Useful on sites that either take a while to load or render data after the rest of the page has loaded. An ideal time for most sites is 2000 (2 seconds).

Request Desktop Website

Depending on which version of a webpage you wish to view, you may want to select Request Desktop Site in order to load the “full” version of the page. However, sometimes the mobile version only has the information. This setting allows you to choose which version of the page you want extract from.

Authenticate

Used to authenticate into some websites, so that extraction can work on sites requiring a login. It may be useful to set “Remember me” on the login page of the service that you use when Authenticating via AutoTools, else you may need to authenticate again before being able to extract at a later time. AutoTools (like your browser will), uses cookies to perform the login later.

Share

This website uses cookies. By using the website, you agree with storing cookies on your computer. Also you acknowledge that you have read and understand our Privacy Policy. If you do not agree leave the website.More information about cookies