My aim is to create a hashmap with a String as the key, and the entry values as a HashSet of Strings.

OUTPUT

This is what the output looks like now:

Hudson+(surname)=[Q2720681], Hudson,+Quebec=[Q141445], Hudson+(given+name)=[Q5928530], Hudson,+Colorado=[Q2272323], Hudson,+Illinois=[Q2672022], Hudson,+Indiana=[Q2710584], Hudson,+Ontario=[Q5928505], Hudson,+Buenos+Aires+Province=[Q10298710], Hudson,+Florida=[Q768903]]

According to my idea, it should look like this:

[Hudson+(surname)=[Q2720681,Q141445,Q5928530,Q2272323,Q2672022]]

The purpose is to store a particular name in Wikidata and then all of the Q values associated with it's disambiguation, so for example:

This is the page for "Bush".

I want Bush to be the Key, and then for all of the different points of departure, all of the different ways that Bush could be associated with a terminal page of Wikidata, I want to store the corresponding "Q value", or unique alpha-numeric identifier.

What I'm actually doing is trying to scrape the different names, values, from the wikipedia disambiguation and then look up the unique alpha-numeric identifier associated with that value in wikidata.

For example, with Bush we have:

George H. W. Bush George W. Bush Jeb Bush Bush family Bush (surname)

Accordingly the Q values are:

George H. W. Bush (Q23505)

George W. Bush (Q207)

Jeb Bush (Q221997)

Bush family (Q2743830)

Bush (Q1484464)

My idea is that the data structure should be construed in the following way

Key: Bush Entry Set: Q23505, Q207, Q221997, Q2743830, Q1484464

But the code I have now doesn't do that.

It creates a seperate entry for each name and Q value. i.e.

Key: Jeb Bush Entry Set: Q221997

Key: George W. Bush Entry Set: Q207

and so on.

The full code in all it's glory can be seen on my github page, but I'll summarize it below also.

This is what I'm using to add values to my data strucuture:

// add Q values to their arrayList in the hash map at the index of the appropriate entity public static HashSet<String> put_to_hash(String key, String value) { if (!q_valMap.containsKey(key)) { return q_valMap.put(key, new HashSet<String>() ); } HashSet<String> list = q_valMap.get(key); list.add(value); return q_valMap.put(key, list); }

This is how I fetch the content:

while ((line_by_line = wiki_data_pagecontent.readLine()) != null) { // if we can determine it's a disambig page we need to send it off to get all // the possible senses in which it can be used. Pattern disambig_pattern = Pattern.compile("<div class=\"wikibase-entitytermsview-heading-description \">Wikipedia disambiguation page</div>"); Matcher disambig_indicator = disambig_pattern.matcher(line_by_line); if (disambig_indicator.matches()) { //off to get the different usages Wikipedia_Disambig_Fetcher.all_possibilities( variable_entity ); } else { //get the Q value off the page by matching Pattern q_page_pattern = Pattern.compile("<!-- wikibase-toolbar --><span class=\"wikibase-toolbar-container\"><span class=\"wikibase-toolbar-item " + "wikibase-toolbar \">\\[<span class=\"wikibase-toolbar-item wikibase-toolbar-button wikibase-toolbar-button-edit\"><a " + "href=\"/wiki/Special:SetSiteLink/(.*?)\">edit</a></span>\\]</span></span>"); Matcher match_Q_component = q_page_pattern.matcher(line_by_line); if ( match_Q_component.matches() ) { String Q = match_Q_component.group(1); // 'Q' should be appended to an array, since each entity can hold multiple // Q values on that basis of disambig put_to_hash( variable_entity, Q ); } } }

and this is how I deal with a disambiguation page:

public static void all_possibilities( String variable_entity ) throws Exception { System.out.println("this is a disambig page"); //if it's a disambig page we know we can go right to the wikipedia //get it's normal wiki disambig page Document docx = Jsoup.connect( "https://en.wikipedia.org/wiki/" + variable_entity ).get(); //this can handle the less structured ones. Elements linx = docx.select( "p:contains(" + variable_entity + ") ~ ul a:eq(0)" ); for (Element linq : linx) { System.out.println(linq.text()); String linq_nospace = linq.text().replace(' ', '+'); Wikidata_Q_Reader.getQ( linq_nospace ); } }