Xidel is a command line tool to download and extract data from HTML/XML pages as well as JSON APIs .





2018-04-02: New release 0.9.8 The 0.9.8 release improves cookie handling, module loading, adds pattern matching between siblings elements, functions to use multipage templates and the variable changelog dynamically, as well as minor bug fixes and performance improvements. 2016-11-20: New release 0.9.6 The 0.9.6 release brings new functions, performance improvements, bug fixes, and stricter default settings. 2016-06-08: New release 0.9.4 The 0.9.4 release completes the XPath/XQuery 3.0 support, implements the EXPath file module, uses a new regular expression library and has various other improvements . 2016-02-25: New development snapshots There is now a download folder with prereleases for upcoming versions. Also the survey below is still running till the 1.0 version.

2015-08-16: Survey for later releases Tbere is now a survey running if the default language of Xidel should be XPath or XQuery (on Google Forms, so you need to login). Both are Turing-complete, but have slightly incompatible string syntax, so the question is which one you prefer. 2015-06-28: New release The 0.9 release adds support for most of the XPath/XQuery 3.0 syntax like anonymous and higher order functions, supports multipart HTTP requests for file uploads, changes the default output format, adds an (experimental) function for page modifications, fixes a large number of bugs mostly related to command line parsing and XPath/XQuery standard compatibility, and more... 2014-08-13: Minor release The 0.8.4 versions extends some standard XQuery expressions with pattern matching, adds options to set HTTP headers and read environment variables, and fixes some bugs... 2014-03-24: New release The 0.8 release improves the JSONiq support and our own JSON extensions, adds arbitrary precision arithmetic, a trivial subset of XPath/XQuery 3, new functions for resolving uri or HTML hrefs, and more... 2013-03-26: New release The 0.7 release adds JSONiq support, grouping of command line options, new input/output formats, fixes HTML parsing/serialization, changes the syntax of extended strings and some other stuff 2012-11-06: New release The 0.6 release adds XQuery support, the form and match functions, improves the Windows command-line interface, merges the two old cgi services to a single one and fixes several interpreter bugs 2012-09-05: Initial release of Xidel First release of the VideLibri backend as stand-alone command-line tool show more show less

Extract expressions: CSS 3 Selectors: to extract simple elements XPath 3.0: to extract values and calculate things with them XQuery 3.0: to create new documents from the extracted values JSONiq: to work with JSON apis Templates: to extract several expressions in an easy way using a annotated version of the page for pattern-matching XPath 2.0/XQuery 1.0: compatibility mode for the old XPath/XQuery version

Following: HTTP Codes: Redirections like 30x are automatically followed, while keeping things like cookies Links: It can follow all links on a page as well as some extracted values Forms: It can fill in arbitrary data and submit the form

Output formats: Adhoc: just prints the data in a human readable format XML: encodes the data as XML HTML: encodes the data as HTML JSON: encodes the data as JSON bash/cmd: exports the data as shell variables

Connections: HTTP / HTTPS as well as local files or stdin

Systems: Windows (using wininet), Linux (using synapse+openssl), Mac (synapse)

Print all urls found by a google search.



xidel http://www.google.de/search?q=test --extract "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']" Print the title of all pages found by a google search and download them:



xidel http://www.google.de/search?q=test --follow "//a/extract(@href, 'url[?]q=([^&]+)&', 1)[. != '']" --extract //title --download '{$host}/' Generally follow all links on a page and print the titles of the linked pages: With XPath: xidel http://example.org -f //a -e //title

With CSS: xidel http://example.org -f "css('a')" --css title

With Pattern matching: xidel http://example.org -f "<a>{.}</a>*" -e "<title>{.}</title>" Another template example:



If you have an example.xml file like <x><foo>ood</foo><bar>IMPORTANT!</bar></x>

You can read the imporant part like: xidel example.xml -e "<x><foo>ood</foo><bar>{.}</bar></x>"

(and this will also check, if the element containing "ood" is there, and fail otherwise) Calculate something with XPath using arbitrary precision arithmetics:



xidel -e "(1 + 2 + 3) * 1000000000000 + 4 + 5 + 6 + 7.000000008" Print all newest Stackoverflow questions with title and url:



xidel http://stackoverflow.com -e "<A class='question-hyperlink'>{title:=text(), url:=@href}</A>*" Print all reddit comments of an user, with HTML and URL:



xidel "http://www.reddit.com/user/username/" --extract "<t:loop><div class='usertext-body'><div>{outer-xml(.)}</div></div><ul class='flat-list buttons'><a><t:s>link:=@href</t:s>permalink</a></ul></div></div></t:loop>" --follow "<a rel='nofollow next'>{.}</a>?" Check if your reddit letter is red: Webscraping, combining CSS, XPath, JSONiq and automatically form evaluation:



xidel http://reddit.com -f "form(css('form.login-form')[1], {'user': '$your_username', 'passwd': '$your_password'})" -e "css('#mail')/@title"

Using the Reddit API:



xidel -d "user=$your_username&passwd=$your_password&api_type=json" https://ssl.reddit.com/api/login --method GET 'http://www.reddit.com/api/me.json' -e '($json).data.has_mail' Use XQuery, to create a HTML table of odd and even numbers:



Windows cmd: xidel --xquery "<table>{for $i in 1 to 1000 return <tr><td>{$i}</td><td>{if ($i mod 2 = 0) then 'even' else 'odd'}</td></tr>}</table>" --output-format xml

Linux/Powershell: xidel --xquery '<table>{for $i in 1 to 1000 return <tr><td>{$i}</td><td>{if ($i mod 2 = 0) then "even" else "odd"}</td></tr>}</table>' --output-format xml

(Xidel itself supports ' and "-quotes on all platforms, but ' does not escape <> in Windows' cmd, and " does not escape $ in the Linux shells) Export variables to shell



Linux/bash: eval "$(xidel http://site -e 'title:=//title' -e 'links:=//a/@href' --output-format bash)"

This sets the bash variable $title to the title of the page and $links becomes an array of all links there.



Windows cmd: FOR /F "delims=" %%A IN ('xidel http://site -e "title:=//title" -e "links:=//a/@href" --output-format cmd') DO %%A

This sets the batch variable %title% to the title of the page and %links% becomes an array of all links there. Reading JSON:



Read the 10th array element: xidel file.json -e '$json(10)'

Read all array elements: xidel file.json -e '$json()'

Read property "foo" and then "bar" with JSONiq notation: xidel file.json -e '$json("foo")("bar")'

Read property "foo" and then "bar" with dot notation: xidel file.json -e '($json).foo.bar'

Read property "foo" and then "bar" with XPath-like notation: xidel file.json -e '$json/foo/bar'

Mixed example: xidel file.json -e '$json("abc")()().xyz/(u,v)'

This would read all the numbers from e.g. {"abc": [[{"xyz": {"u": 1, "v": 2}}], [{"xyz": {"u": 3}}, {"xyz": {"u": 4}} ]]} .

All selectors are sequence-transparent, i.e. you can use the same selector to read something from one value as to read it from several values. Arrays are converted to sequences with () Using XPath 3.1 syntax (requires Xidel 0.9.9): Read the 10th array element: xidel file.json -e '$json?10'

Read all array elements: xidel file.json -e '$json?*'

Read property "foo" and then "bar" with 3.1 notation: xidel file.json -e '$json?foo?bar' Convert table rows and columns to a CSV-like format:



xidel http://site -e '//tr / join(td, ",")'



join((...)) can generally be used to output some values in a single line. The function name is an abbreviation for the XPath function string-join . In the example tr / join calls join for every row. Modify/Transform an HTML file, e.g. to mark all links as bold:



Windows cmd: xidel --html your-file.html --xquery "transform(/, function($e) { $e / (if (name() = 'a') then <a style='{join((@style, 'font-weight: bold'), '; ')}'>{@* except @style, node()}</a> else .) })" > your-output-file.html Linux/Powershell: xidel --html your-file.html --xquery 'transform(/, function($e) { $e / (if (name() = "a") then <a style="{join((@style, "font-weight: bold"), "; ")}">{@* except @style, node()}</a> else .) })' > your-output-file.html

This example combines three important syntaxes: transform(/, function($e) { .. } : This applies an anonymous function to every element in the HTML document, whereby that element is stored in variable $e and is replaced by the return value of the function.

: This applies an anonymous function to every element in the HTML document, whereby that element is stored in variable and is replaced by the return value of the function. <a>{@* except @style, node()}</a> : This creates a new a -element that has the same children, descendants and attributes as the current element, but removes the style -attribute.

: This creates a new -element that has the same children, descendants and attributes as the current element, but removes the -attribute. style="{join((@style, "font-weight: bold"), "; ")}" : This creates a new style -attribute by appending "font-weight: bold" to the old value of the attribute. A separating "; " is inserted, if (and only if) that attribute already existed.

The last official release is Xidel 0.9.8, but a Xidel 0.9.9 development version is published irregularly as preview for the next release. It is recommended to use the 0.9.9 version, since it contains bug fixes, is more performant, and partially supports XPath/XQuery 3.1. Thereby most of the JSONiq syntax has been replaced by the XPath 3.1 JSON syntax. It will be published officially, once all of XPath/XQuery 3.1 is implemented.

http://www.videlibri.de/cgi-bin/xidelcgi?data=<html><title>foobar</title></html>&extract=//title&raw=true

./build.sh

fpc xidel.pas

-Fu

-Fi

components/pascal/internettools.lpk

components/pascal/internettools_utf8.lpk

programs/internet/xidel/xidel.lpi

(Please do not ask me how to scrape your website. Ask how to do something with Xidel instead. I know Xidel, I do not know your website. The point of the tool is to make it easy for anyone to parse any webpage. Scraping every webpage myself does not scale well.)