From 12 – HTTP C – Python for Everybody Course – YouTube
Learning about the socket library, sockets, UTF 8, HTTP, HTML, and ASCII / Unicode
import socket
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('data.pr4e.org', 80))
cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n'.encode()
mysock.send(cmd)
while True:
data = mysock.recv(512)
if len(data) < 1:
break
print(data.decode(),end='')
mysock.close()
From 12 – HTTP E – Python for Everybody Course – YouTube
Now using the urllib library in Python
import urllib.request
file_handle = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')
for line in file_handle:
print(line.decode().strip())
From 12 – HTTP F – YouTube
Using Beautiful Soup
Beautiful Soup Documentation — Beautiful Soup 4.12.0 documentation (crummy.com)
My link-scraping script
import requests
from bs4 import BeautifulSoup
# asking for URL of the website you want to scrape
url = 'https://'+ input("Website please : ")
# Sending a GET request to the URL
response = requests.get(url)
# Check if the request was successful (status code 200)
if response.status_code == 200:
# Parse the HTML content of the page
soup = BeautifulSoup(response.text, 'html.parser')
# Extract all the links on the page
links = soup.find_all('a')
# Print each link
for link in links:
print(link.get('href'))
else:
print(f"Failed to retrieve the page. Status code: {response.status_code}")
Installing libraries
pip install beautifulsoup4
pip install requests
pip install lxml
0 Comments