Project Title: Scrape Hotels Listings Reviews, Location, Property Information, User Data

Project Description:

1. Context

– We are looking for a partner to externalize its website crawling / scraping / data extraction activities.

– We are looking for a SLA of 99%. Quality and accuracy are of utmost importance.

– The number of websites to crawl will increase over time and exceed 100.

– Industry: Hotels.

– Type of data: (1) hotel listings, (2) hotel property information, (3) hotel location data, (4) travel reviews, (5) user data.

– Languages: All.

– Period: data from the past (with specific date limitation) but also incremental data on monthly basis.

– Coverage:

– Short-term: list of cities to be provided by Travelsify.

– Long-term: Worldwide.

– Additional challenge to standard scraping: all scraped data should be normalized per hotel property, meaning that hotel property matching across websites is mandatory:

– Example: Hotel Property #000001 is composed of:

– https://www.expedia.com/Barcelona-Hotels-W-Barcelona.h2578680.Hotel-Information

– http://www.tripexpert.com/barcelona/hotels/w-barcelona

– https://www.tripadvisor.com/Hotel_Review-g187497-d1465497-Reviews-W_Barcelona-Barcelona_Catalonia.html

– http://www.booking.com/hotel/es/w-barcelona.en-gb.html

– http://www.telegraph.co.uk/travel/destinations/europe/spain/catalonia/barcelona/hotels/w-barcelona-hotel/

– Pilot: Before entering into any commercial collaboration, We are willing to have a test on the following cities to validate accuracy, performance, and way of working:

a. Paris (city and surroundings), France

b. Barcelona (downtown city), Spain

c. Boston (downtown city), MA-USA



2. Websites

– For the purpose of the pilot, we are looking to scrap the 5 following websites:

a. Booking: https://www.booking.com

b. TripAdvisor: https://www.tripadvisor.com

c. Expedia: https://www.expedia.com

d. TripExpert: http://www.tripexpert.com

e. Telegraph Travel: http://www.telegraph.co.uk/travel/

– Should the pilot on the 3 cities mentioned above be conclusive, the list of websites to crawl/scrap will be extended and exceed 100.



3. Hotel Property Listings

– For each of the websites to scrap, the list of hotel properties per city needs to be retrieved.

– The hotel property listings should not include apartments, bed and breakfasts, boats, guesthouses, hostels, holiday homes, love hotels.

– Hotels ranging from 0 to 5 stars.

– Additional work: hotel properties matching across websites (cf. chapter #1)



4. Hotel Location Data

– For each of the websites to scrap, the following location data shall be retrieved:

a. Hotel name

b. Hotel page url from the website, e.g. http://www.booking.com/hotel/es/rivoliramblas.engb.html

c. Hotel street

d. Hotel street number

e. Hotel city

f. Hotel postal code

g. Hotel country

h. Hotel longitude (only for Expedia, TripAdvisor, Booking)

i. Hotel latitude (only for Expedia, TripAdvisor, Booking)



5. Hotel Property Information

– For each of the websites to scrap, the following hotel property data shall be retrieved if they are present:

a. Hotel name

b. Hotel other names, e.g. Expedia indicates whether the hotel has other names and which ones: https://www.expedia.com/Barcelona-Hotels-W-Barcelona.h2578680.Hotel-Information (the other names for W Barcelona are: Barcelona W, W Hotel, W Barcelona Catalonia, W Barcelona, W Hotel Barcelona, w Barcelona Hotel Barcelona)

c. Hotel ID Strictly confidential Page 3 April 23, 2016

d. Hotel full description in all languages from the website

e. Hotel description as provided by the hotel (“official description” from TripAdvisor, “an inside look at…” from Booking.com)

f. Number of rooms

g. Hotel chain name

h. List of all hotel amenities/facilities (only in English)

i. List of all hotel policies (only in English) including hotel price range on TripAdvisor



6. Travel reviews

– For each of the websites to scrap, the following travel reviews data shall be retrieved if they are present. Please note that the travel reviews can come from users (TripAdvisor, Booking, Expedia) but also from travel experts/journalists (TripExpert, Telegraph Travel).

a. Overall review score

b. Review score per category (cleanliness, location, staff, comfort, free wi-fi, value for money, etc.)

c. Scraping of all reviews in all languages since November 2014. Please note that translations of reviews are not accepted.

d. Date of the travel review

e. Language of the travel review



7. User Data

– For each travel review, the following user data shall be retrieved if they are present:

a. User name

b. User gender (male/female)

c. User age/age range

d. User city

e. User country



For similar work requirement feel free to email us on info@webscrapingexpert.com