Wikipedia

I tried to download the Wikipedia’s dump from here and create an local database of Wikipedia. You can refer to Robert’s instruction to import entire Wikipedia into your local database.

The next step is to create a list of keywords from the title of each Wikipedia’s article, because Wikipedia seems to explain every single term in its encyclopedia. Different ways of presentation has also been captured in the Wikipedia, such as “World War I, “WW1” or “World War One”.

Now, we have the list of keywords here (Chinese version) and we can use simple scripts to extract the keywords from an article.

keyword_article = []

for k in keywords:

k = re.sub(“\r

”,””,k)

if k in articlekeyword_article.append(k)

However, this is not the end of the step, because the list of extracted keyword may have some overlapped keywords, like “World”, “War”, “World War” when getting the keyword “World War I”. These overlapped keywords are filtered out.

keyword_overlap = []

for g in keyword_article:

for h in keyword_article:

if g != h:

if h in g:

keyword_overlap.append(h) wiki_terms = list(set(keyword_article)-set(keyword_overlap))

The next step is to identify the importance of the keywords the program extract from the article.