Scraping da Telegram

Samueleex

Oggi vedremo come eseguire uno scraping da Telegram e come configurarlo al meglio, iniziamo a scrivere il codice.
Importiamo i moduli necessari:

datetime.
time.
TelegramClient da telethon.sync: per interagire con l'API di Telegram.
gspread: per interagire con Google Sheets.
default da google.auth: in modo da ottenere le credenziali di autenticazione predefinite.
auth da google.colab.

la funzione auth.authenticate_user() è stata creata per ottenere le credenziali di accesso creds, _ = default() sono quelle predefinite da parte di Google.
Vengono richiesti i dati necessari per l'accesso al canale Telegram: telegram_username, telegram_phone, telegram_api_id, telegram_api_hash, channel_name, worksheet_name.
Scriviamo una funzione che tenta di aprire il foglio di lavoro xlsx e se non esiste ne viene creato uno nuovo, che verrà utilizzato per salvare i risultati.
Vengono definiti i titoli delle colonne nel foglio di lavoro xlsx, questi titoli rappresentano le informazioni che verranno raccolte e aggiornate.
La funzione scrape_and_update sfrutta TelegramClient per connettersi al canale Telegram definito nel codice nel quale itera attraverso i messaggi del canale e aggiorna il foglio xlsx. Riesce anche a gestire i media allegati, le reazioni, i commenti e altri dettagli dei messaggi.

from datetime import datetime, timezone
from datetime import datetime as dt
import time

from telethon.sync import TelegramClient
import gspread
from google.auth import default
from google.colab import auth

# autenticazione Google
auth.authenticate_user()
creds, _ = default()
google_auth = gspread.authorize(creds)

# Telegram
telegram_username = 'username'
telegram_phone = '+39...'
telegram_api_id = '...'
telegram_api_hash = '...'

# scraping
channel_name = '@ehf'
worksheet_name = 'Telegram_Scraper_Output'
start_date = datetime(2019, 1, 1, tzinfo=timezone.utc)
end_date = datetime(2023, 1, 1, tzinfo=timezone.utc)
search_keyword = ''

# Google Sheets
try:
    spreadsheet = google_auth.open(worksheet_name).sheet1
except gspread.exceptions.SpreadsheetNotFound:
    spreadsheet = google_auth.create(worksheet_name).sheet1

# xlsx
spreadsheet.clear()

# titoli
titles = ['Scraping ID', 'Group', 'Author ID', 'Content', 'Date', 'Message ID', 'Author', 'Views', 'Reactions', 'Shares', 'Media', 'Comments']
spreadsheet.update('A1:L1', [titles])

# gestire il scraping su xlsx
async def scrape_and_update():
    async with TelegramClient(telegram_username, telegram_api_id, telegram_api_hash) as client:
        index = 1
        async for message in client.iter_messages(channel_name, search=search_keyword):
            if start_date < message.date < end_date:
                media_url = f'https://t.me/{channel_name}/{message.id}'.replace('@', '') if message.media else 'no media'
                reactions = ""
                if message.reactions:
                    for reaction in message.reactions.results:
                        reactions += f"{reaction.reaction.emoticon} {reaction.count} "
                
                comments = ""
                try:
                    async for comment_message in client.iter_messages(channel_name, reply_to=message.id):
                        comments += f"{comment_message.text};\n"
                except:
                    comments = 'possible adjustment'

                content = [
                    f'#ID{index:05}',
                    channel_name,
                    message.sender_id,
                    message.text,
                    message.date.strftime('%Y-%m-%d %H:%M:%S'),
                    message.id,
                    message.post_author,
                    message.views,
                    reactions,
                    message.forwards,
                    media_url,
                    comments
                ]

                spreadsheet.update(f"A{index+1}:L{index+1}", [content])

                print(f'Item {index:05} completed!')
                print(f'Id: {message.id:05}.\n')

                index += 1
                time.sleep(1)

# scraping
await scrape_and_update()

# end
print("end")

Prima di eseguire il codice dobbiamo installare il necessario, da terminale:
pip install telethon
pip install google-auth google-auth-oauthlib google-auth-httplib2 google-api-python-client

Per vedere i valori personali di api_id api_hash dovete recarvi sulla sezione Apps di Telegram:

Ora per il primo avvio dovrete seguire questi passaggi, eseguito il codice come primo Google Colab vi chiederà "allow this laptop to access your Google credentials?" Acconsentite.

Scegli un account Google dove caricare le Runtimes, successivamente dovrai configurare Telegram:

Eseguita la verifica vi arriverà questo messaggio:

Inserisci il codice ottenuto e ti verrà notificato l'accesso.
Inizierà in automatico lo scraping:

Poi sui fogli Google dovrebbe generarsi un file:

Finito!

Kindy

Samueleex bravo. È fatto molto bene ed è semplice da usare.
Vedo altri che creano bot di scraping ed adding per raspare(rubare) i membri degli altri gruppi ed aggiungerli al proprio. Oppure ci sta anche creare qualche userbot che fa l'adding od entrambe le cose.

Fondatori