If you want to convert a Wikipedia article to Markdown, you can use my open source package I wrote to do it in seconds.
I made it because I wanted to convert some Wikipedia articles to Markdown for my personal notes and some AI / ML projects. I couldn't find a simple script to do this, so I wrote one myself. I hope you find it useful.
This is a simple script to convert a Wikipedia article to Markdown and optionally download the images too.
git clone https://github.com/erictherobot/wikipedia-markdown-generator.git
cd wikipedia-markdown-generator
pip3 install -r requirements.txt
python3 wiki-to-md.py <topic_name>
The output is a Markdown file with the same name as the topic name under the newly created directory md_output if using wiki-to-md.py. If you want to download images too, use the wiki-to-md-images.py file and the images will be placed inside md_output/images/.
Note: eventually,
wiki-to-md.pyandwiki-to-md-images.pywill be combined into one script with a flag to download images or not.
I wanted to convert some Wikipedia articles to Markdown for my personal notes. I couldn't find a simple script to do this, so I wrote one myself.
Yes, I wouldn't have it any other way. I hope you find it useful.
There are two scripts, one that downloads images and one that doesn't. I'll show you both.
Here's the wiki-to-md.py file:
import os
import wikipedia
import argparse
import re
def generate_markdown(topic):
try:
page = wikipedia.page(topic)
except wikipedia.exceptions.DisambiguationError as e:
print(e.options)
return
wikipedia.exceptions.PageError:
()
markdown_text =
page_content = re.sub(, , page.content)
page_content = re.sub(, , page_content)
sections = re.split(, page_content)
i (, (sections), ):
i + < (sections) (
line.strip() line sections[i + ].split()
):
markdown_text +=
directory =
os.makedirs(directory, exist_ok=)
filename = os.path.join(directory, )
(filename, ) md_file:
md_file.write(markdown_text)
()
filename
parser = argparse.ArgumentParser(
description=
)
parser.add_argument(
,
=,
=,
)
args = parser.parse_args()
topic =
generate_markdown(topic)
Here's the wiki-to-md-images.py file (incase you want to scrape images too):
import os
import wikipedia
import argparse
import re
import requests
import urllib.parse
def generate_markdown(topic):
try:
page = wikipedia.page(topic)
except wikipedia.exceptions.DisambiguationError as e:
print(e.options)
return None
except wikipedia.exceptions.PageError:
print(f"Page not found for the topic: {topic}")
return None
markdown_text = f"# {topic}\n\n"
page_content = re.sub(r"=== ([^=]+) ===", r"### \1", page.content)
page_content = re.sub(r"== ([^=]+) ==", r"## \1", page_content)
sections = re.split(r"\n(## .*)\n", page_content)
for i in range(0, len(sections), 2):
if i + 1 < len(sections) and any(
line.strip() for line in sections[i + 1].split("\n")
):
markdown_text += f"{sections[i]}\n{sections[i+1]}\n\n"
# Create a directory for markdown files
output_directory = "md_output"
os.makedirs(output_directory, exist_ok=True)
# Create a directory for image files
image_directory = os.path.join(output_directory, "images")
os.makedirs(image_directory, exist_ok=)
image_url page.images:
image_filename = urllib.parse.unquote(os.path.basename(image_url))
image_path = os.path.join(image_directory, image_filename)
image_data = requests.get(image_url).content
(image_path, ) image_file:
image_file.write(image_data)
markdown_text +=
filename = os.path.join(output_directory, )
(filename, ) md_file:
md_file.write(markdown_text)
()
filename
parser = argparse.ArgumentParser(
description=
)
parser.add_argument(
,
=,
=,
)
args = parser.parse_args()
topic =
generate_markdown(topic)
This project is licensed under the MIT License - see the LICENSE file for details.
If you find this useful as is please let me know. If you find any bugs, please feel free to submit a pull request or open an issue. If you have any questions, you can contact me.
Please consider Buying Me A Coffee. I work hard to bring you my best content and any support would be greatly appreciated. Thank you for your support!