Email analyzer in Python

Samueleex

Email Analyzer

Email Analyzer è un tool scritto in Python per l'analisi di email memorizzate in locale. Il programma può analizzare moltissime email, rilevare spam, rimuovere duplicati, estrarre keywords e generare report.

Requisiti

Python
File email nei seguenti formati: .txt, .eml, .html, .htm

Dipendenze

(Non obbligatorie)

pip install beautifulsoup4 tqdm textstat pyyaml

beautifulsoup4 parsing email HTML
tqdm progress bar nel caricanento
textstat analisi della leggibilità del testo
pyyaml per i file YAML

Esempi utilizzo

# Uso base
python emailanalyzer.py /percorso/cartella/email

# Con tutte le opzioni
python emailanalyzer.py /cartella/email -o risultati -f both -k "fattura" "pagamento" --save-summary

Sintassi

python emailanalyzer.py <cartella> [opzioni]

L'unico argomento obbligatorio è la cartella dove è collocato il vostro file email!

Opzioni

Output

-o, --output: nome file di output (quello di default email_analysis)
-f, --format : Formato di output:
- csv
- json
- both: entrambi i formati (default)

Configurazione:

-c, --config: percorso file di configurazione (supporta .yaml, .json, .ini)
-k, --keywords: lista keywords
-s, --similarity: riconoscimento parole simili per ripetizione (regolabile 0.0-1.0 ma default: 0.8)

Comportamento:

--no-recursive: non cerca email nelle sottocartelle
--debug: modalità debug
--test: test automatizzati
--save-summary: salvataggio resoconto finale in un file .txt

Esempi d'Uso

Esempi Base

Analisi semplice:

python emailanalyzer.py /home/user/emails

Solo formato CSV:

python emailanalyzer.py /home/user/emails -f csv

Nome file personalizzato:

python emailanalyzer.py /home/user/emails -o analisi

Esempio più complesso

python emailanalyzer.py /emails \
  -o analisi_completa \
  -f both \
  -k "fattura" "ricevuta" "\\bordine\\b" \
  -s 0.9 \
  --save-summary \
  --debug

Solo cartella corrente (senza sottocartelle):

python emailanalyzer.py ./emails --no-recursive

Con file di configurazione:

python emailanalyzer.py /emails -c config.yaml -o risultati_config

Configurazione

Formati di Configurazione Supportati

Il programma supporta tre formati di configurazione:

YAML (Consigliato)

# config.yaml
keywords:
  - "\\bfattura\\b"
  - "\\bpagamento\\b"
  - "\\bordine\\b"
  - "\\bconferma\\b"
  - "\\bricevuta\\b"
  - "\\bbolletta\\b"

similarity_threshold: 0.8
spam_threshold: 3
recursive: true
output_format: "both"
debug: false

spam_indicators:
  - "noreply"
  - "no-reply"
  - "newsletter"
  - "marketing"
  - "unsubscribe"
  - "offerta"
  - "gratis"
  - "vincere"
  - "promozione"
  - "sconto"
  - "limited time"
  - "act now"
  - "click here"
  - "urgent"
  - "winner"
  - "congratulations"

fallback_patterns:
  from: "(?:from|da|mittente):\\s*(.+)"
  subject: "(?:subject|oggetto|subj):\\s*(.+)"
  date: "(?:date|data):\\s*(.+)"
  to: "(?:to|a|destinatario):\\s*(.+)"

JSON

{
  "keywords": [
    "\\bfattura\\b",
    "\\bpagamento\\b",
    "\\bordine\\b"
  ],
  "similarity_threshold": 0.8,
  "spam_threshold": 3,
  "recursive": true,
  "spam_indicators": [
    "noreply",
    "newsletter",
    "offerta"
  ]
}

INI

[DEFAULT]
keywords = fattura,pagamento,ordine
similarity_threshold = 0.8
spam_threshold = 3
recursive = true
debug = false

Parametri per la config

Cerca e filtri:

keywords
similarity_threshold (0.0-1.0 | default: 0.8)
spam_threshold ranking per considerarla spam (default: 3)
`spam_indicators' parole/frasi che indicano spam

Comportamento:

recursive sottocartelle (default: true)
output_format ("csv", "json", "both")
debug di default è impostata su false.

Pattern più complessi

fallback_patterns regex per il parsing se il parser standard restituisce errore.

Schema logico:

Funzionalità Principali

1. Lettura Email in diversi formati

I formati supportati sono:

.eml (parser nativo Python)
.txt (parser email + fallback regex)
.html, .htm (BeautifulSoup + estrazione metadati)

2. Che dati vengono estratti?

Informazioni base:

filename
sender mittente
subject
date
body (primi 1000 caratteri)

Analisi complessa:

urls URL trovati
url_count numero URL
keywords_found
content_hash: hash MD5 (per i duplicati)
is_spam: flag che indica se l'email è spam
readability_score (se textstat è disponibile)

3. Rilevamento spam

Tutti gli indicatori:

Parole spam presenti nella lista (+1)
Mittenti no-reply (+2)
Punteggiatura eccessiva (es più di 2 punti esclamativi consecutivi) (+1)
Troppe maiuscole in particolare più di 3 parole in maiuscolo (+1)

Soglia spam: è configurabile ma di default 3 punti.

4. Cancellazione duplicati

Il tool rimuove i duplicati in tre fasi:

Elimina le email flaggate come spam.
Hash MD5 serve a rimuovere le email con contenuto uguale.
Rimuove le email con oggetti simili (questa soglia è configurabile)

5. Ricerca keywords

è possibile:

Cercare un testo semplice (case-insensitive)
Espressioni (es \bfattura\b per parole intere)
Modificare con config o parametri da terminale.

6. Estrazione URL

URL HTTP/HTTPS dal testo
Link da tag HTML <a href="...">
Immagini da tag HTML <img src="...">
Rimozione URL duplicati

Output / report

File CSV

Filename
Mittente
Oggetto
Data (YYYY-MM-DD HH:MM:SS)
URL_Count
Lista_URL
Parole_Chiave(separate da virgola)
Anteprima_Corpo 100 caratteri
Spam_Score "Spam" o "pulita"
Readability (0-100)

File JSON

Il file JSON contiene un array di oggetti con tutti i dati completi:

[
  {
    "filename": "",
    "sender": "",
    "subject": "",
    "date": "",
    "body": "",
    "urls": [""],
    "url_count": 1,
    "keywords_found": ["conferma", "ordine"],
    "content_hash": "",
    "is_spam": false,
    "readability_score": ...
  }
]

Modalità Debug

Attivando --debug il programma restituisce:

errori con più info
info su parsing
stats di elaborazione
info su pattern

Testing

python emailanalyzer.py percorso --test

Codice:

#!/usr/bin/env python3

import os
import re
import json
import csv
import hashlib
import argparse
import html
import unittest
import configparser
import yaml
from datetime import datetime
from email import message_from_string
from email.utils import parsedate_to_datetime, parseaddr
from pathlib import Path
import difflib

try:
    from bs4 import BeautifulSoup
    HAS_BS4 = True
except ImportError:
    HAS_BS4 = False
    print("⚠️  BeautifulSoup non disponibile. Installare con: pip install beautifulsoup4")

try:
    from tqdm import tqdm
    HAS_TQDM = True
except ImportError:
    HAS_TQDM = False
    print("⚠️  tqdm non disponibile. Installare con: pip install tqdm")

try:
    import textstat
    HAS_TEXTSTAT = True
except ImportError:
    HAS_TEXTSTAT = False

class EmailAnalyzer:
    def __init__(self, folder_path, config=None):
        self.folder_path = Path(folder_path)
        self.config = config or self.load_default_config()
        self.emails = []
        self.processed_emails = []
        self.stats = {
            'total_files': 0,
            'loaded_emails': 0,
            'spam_removed': 0,
            'duplicates_removed': 0,
            'final_count': 0
        }
        
        self.keyword_patterns = []
        for kw in self.config['keywords']:
            try:
                pattern = re.compile(kw, re.IGNORECASE)
                self.keyword_patterns.append((kw, pattern))
            except re.error:
                pattern = re.compile(re.escape(kw), re.IGNORECASE)
                self.keyword_patterns.append((kw, pattern))
    
    def load_default_config(self):
        return {
            'keywords': [
                r'\bfattura\b', r'\bpagamento\b', r'\bordine\b', 
                r'\bconferma\b', r'\bricevuta\b', r'\bbolletta\b'
            ],
            'similarity_threshold': 0.8,
            'spam_threshold': 3,
            'recursive': True,
            'output_format': 'both',
            'debug': False,
            'spam_indicators': [
                'noreply', 'no-reply', 'newsletter', 'marketing',
                'unsubscribe', 'offerta', 'gratis', 'vincere',
                'promozione', 'sconto', 'limited time', 'act now',
                'click here', 'urgent', 'winner', 'congratulations'
            ],
            'fallback_patterns': {
                'from': r'(?:from|da|mittente):\s*(.+)',
                'subject': r'(?:subject|oggetto|subj):\s*(.+)',
                'date': r'(?:date|data):\s*(.+)',
                'to': r'(?:to|a|destinatario):\s*(.+)'
            }
        }
    
    def load_config_file(self, config_path):
        config_path = Path(config_path)
        
        if not config_path.exists():
            print(f"⚠️  File di configurazione non trovato: {config_path}")
            return self.load_default_config()
        
        try:
            if config_path.suffix.lower() == '.yaml':
                with open(config_path, 'r', encoding='utf-8') as f:
                    return yaml.safe_load(f)
            elif config_path.suffix.lower() == '.ini':
                config = configparser.ConfigParser()
                config.read(config_path, encoding='utf-8')
                return dict(config['DEFAULT'])
            else:
                with open(config_path, 'r', encoding='utf-8') as f:
                    return json.load(f)
        except Exception as e:
            print(f"⚠️  Errore nel caricare configurazione: {e}")
            return self.load_default_config()
    
    def read_emails(self):
        print(f"📂 Lettura email da: {self.folder_path}")
        
        pattern = "**/*" if self.config.get('recursive', True) else "*"
        email_files = []
        
        for ext in ['*.txt', '*.eml', '*.html', '*.htm']:
            email_files.extend(self.folder_path.glob(f"{pattern}.{ext.split('.')[-1]}"))
        
        self.stats['total_files'] = len(email_files)
        
        if not email_files:
            print("❌ Nessun file email trovato (.txt, .eml, .html)")
            return
            
        print(f"📧 Trovati {len(email_files)} file email")
        
        iterator = tqdm(email_files, desc="Caricamento email") if HAS_TQDM else email_files
        
        for file_path in iterator:
            try:
                email_data = self.read_single_email(file_path)
                if email_data:
                    self.emails.append(email_data)
                    
            except Exception as e:
                if self.config.get('debug', False):
                    print(f"⚠️  Errore nel leggere {file_path.name}: {e}")
        
        self.stats['loaded_emails'] = len(self.emails)
        print(f"✅ Caricate {len(self.emails)} email")
    
    def read_single_email(self, file_path):
        with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
            content = f.read()
        
        if file_path.suffix.lower() in ['.html', '.htm']:
            return self.parse_html_email(content, file_path.name)
        else:
            return self.parse_email(content, file_path.name)
    
    def parse_html_email(self, content, filename):
        if not HAS_BS4:
            return self.parse_email(content, filename)
        
        try:
            soup = BeautifulSoup(content, 'html.parser')
            
            sender = self.extract_meta_or_fallback(soup, 'from', content)
            subject = self.extract_meta_or_fallback(soup, 'subject', content)
            date_str = self.extract_meta_or_fallback(soup, 'date', content)
            
            email_date = self.parse_date(date_str)
            
            body = soup.get_text(separator=' ', strip=True)
            
            urls = self.extract_urls_from_html(soup)
            
            return self.create_email_data(filename, sender, subject, email_date, body, urls)
            
        except Exception as e:
            if self.config.get('debug', False):
                print(f"Errore parsing HTML {filename}: {e}")
            return self.parse_email(content, filename)
    
    def extract_meta_or_fallback(self, soup, field, content):
        meta_selectors = {
            'from': ['meta[name="from"]', 'meta[property="from"]'],
            'subject': ['meta[name="subject"]', 'meta[property="subject"]', 'title'],
            'date': ['meta[name="date"]', 'meta[property="date"]']
        }
        
        if field in meta_selectors:
            for selector in meta_selectors[field]:
                element = soup.select_one(selector)
                if element:
                    return element.get('content') or element.get_text()
        
        return self.extract_with_regex(content, field)
    
    def extract_urls_from_html(self, soup):
        urls = set()
        
        for link in soup.find_all('a', href=True):
            href = link['href']
            if href.startswith(('http://', 'https://')):
                urls.add(href)
        
        for img in soup.find_all('img', src=True):
            src = img['src']
            if src.startswith(('http://', 'https://')):
                urls.add(src)
        
        return list(urls)
    
    def parse_email(self, content, filename):
        try:
            msg = message_from_string(content)
            
            sender = self.clean_email_address(msg.get('From', ''))
            subject = self.decode_header(msg.get('Subject', ''))
            date_str = msg.get('Date', '')
            email_date = self.parse_date(date_str)
            body = self.extract_body(msg)
            
        except:
            sender = self.extract_with_regex(content, 'from')
            subject = self.extract_with_regex(content, 'subject')
            date_str = self.extract_with_regex(content, 'date')
            email_date = self.parse_date(date_str)
            body = content
        
        urls = self.extract_urls(body)
        
        return self.create_email_data(filename, sender, subject, email_date, body, urls)
    
    def extract_with_regex(self, content, field):
        patterns = self.config.get('fallback_patterns', {})
        if field not in patterns:
            return 'Sconosciuto' if field != 'body' else content
        
        pattern = patterns[field]
        match = re.search(pattern, content, re.IGNORECASE | re.MULTILINE)
        
        if match:
            return match.group(1).strip()
        
        return 'Sconosciuto' if field != 'body' else content
    
    def parse_date(self, date_str):
        if not date_str:
            return datetime.now()
        
        try:
            return parsedate_to_datetime(date_str)
        except:
            pass
        
        formats = [
            '%Y-%m-%d %H:%M:%S',
            '%Y-%m-%d',
            '%d/%m/%Y %H:%M:%S',
            '%d/%m/%Y',
            '%d-%m-%Y %H:%M:%S',
            '%d-%m-%Y'
        ]
        
        for fmt in formats:
            try:
                return datetime.strptime(date_str.strip(), fmt)
            except:
                continue
        
        return datetime.now()
    
    def create_email_data(self, filename, sender, subject, email_date, body, urls):
        sender = html.escape(sender or 'Sconosciuto')
        subject = html.escape(subject or 'Nessun oggetto')
        body = html.escape(body[:1000])
        
        found_keywords = self.find_keywords_regex(subject + ' ' + body)
        
        content_hash = hashlib.md5((subject + body).encode('utf-8')).hexdigest()
        
        is_spam = self.detect_spam_advanced(sender, subject, body)
        
        return {
            'filename': filename,
            'sender': sender,
            'subject': subject,
            'date': email_date,
            'body': body,
            'urls': urls,
            'url_count': len(urls),
            'keywords_found': found_keywords,
            'content_hash': content_hash,
            'is_spam': is_spam,
            'readability_score': self.calculate_readability(body) if HAS_TEXTSTAT else 0
        }
    
    def find_keywords_regex(self, text):
        found = []
        
        for keyword_name, pattern in self.keyword_patterns:
            if pattern.search(text):
                found.append(keyword_name)
        
        return found
    
    def calculate_readability(self, text):
        if not HAS_TEXTSTAT or not text.strip():
            return 0
        
        try:
            return textstat.flesch_reading_ease(text)
        except:
            return 0
    
    def detect_spam_advanced(self, sender, subject, body):
        spam_indicators = self.config.get('spam_indicators', [])
        text_to_check = (sender + ' ' + subject + ' ' + body).lower()
        
        spam_score = 0
        
        for indicator in spam_indicators:
            if indicator in text_to_check:
                spam_score += 1
        
        if len(re.findall(r'[!]{2,}', text_to_check)) > 0:
            spam_score += 1
        
        if len(re.findall(r'[A-Z]{3,}', subject + body)) > 3:
            spam_score += 1
        
        if 'noreply' in sender.lower() or 'no-reply' in sender.lower():
            spam_score += 2
        
        threshold = self.config.get('spam_threshold', 3)
        return spam_score >= threshold
    
    def clean_email_address(self, email_str):
        if not email_str:
            return 'Sconosciuto'
        
        name, email = parseaddr(email_str)
        return html.escape(email if email else email_str)
    
    def decode_header(self, header):
        if not header:
            return ''
        
        from email.header import decode_header
        try:
            decoded = decode_header(header)
            result = ''.join([
                part.decode(encoding or 'utf-8') if isinstance(part, bytes) else str(part)
                for part, encoding in decoded
            ])
            return html.escape(result)
        except:
            return html.escape(str(header))
    
    def extract_body(self, msg):
        body = ""
        
        if msg.is_multipart():
            for part in msg.walk():
                content_type = part.get_content_type()
                if content_type in ["text/plain", "text/html"]:
                    try:
                        payload = part.get_payload(decode=True)
                        if payload:
                            decoded = payload.decode('utf-8', errors='ignore')
                            if content_type == "text/html" and HAS_BS4:
                                soup = BeautifulSoup(decoded, 'html.parser')
                                body += soup.get_text(separator=' ', strip=True)
                            else:
                                body += decoded
                    except:
                        body += str(part.get_payload())
        else:
            try:
                payload = msg.get_payload(decode=True)
                if payload:
                    body = payload.decode('utf-8', errors='ignore')
                else:
                    body = str(msg.get_payload())
            except:
                body = str(msg.get_payload())
        
        return body
    
    def extract_urls(self, text):
        url_pattern = r'https?://(?:[-\w.])+(?:[:\d]+)?(?:/(?:[\w/_.])*(?:\?(?:[\w&=%.])*)?(?:#(?:[\w.])*)?)?'
        urls = re.findall(url_pattern, text)
        return list(set(urls))
    
    def remove_duplicates_and_spam(self):
        print("🧹 Rimozione duplicati e spam...")
        
        non_spam = [email for email in self.emails if not email['is_spam']]
        self.stats['spam_removed'] = len(self.emails) - len(non_spam)
        
        seen_hashes = set()
        unique_emails = []
        
        for email in non_spam:
            if email['content_hash'] not in seen_hashes:
                seen_hashes.add(email['content_hash'])
                unique_emails.append(email)
        
        final_emails = []
        similarity_threshold = self.config.get('similarity_threshold', 0.8)
        
        iterator = tqdm(unique_emails, desc="Rimozione simili") if HAS_TQDM else unique_emails
        
        for email in iterator:
            is_similar = False
            for existing in final_emails:
                similarity = difflib.SequenceMatcher(
                    None, 
                    email['subject'].lower(), 
                    existing['subject'].lower()
                ).ratio()
                
                if similarity > similarity_threshold:
                    is_similar = True
                    break
            
            if not is_similar:
                final_emails.append(email)
        
        self.stats['duplicates_removed'] = len(unique_emails) - len(final_emails)
        self.stats['final_count'] = len(final_emails)
        
        print(f"🗑️  Rimossi: {self.stats['spam_removed']} spam, {self.stats['duplicates_removed']} duplicati/simili")
        print(f"✅ Email pulite: {self.stats['final_count']}")
        
        self.processed_emails = final_emails
    
    def export_csv(self, output_file):
        print(f"📊 Esportazione CSV: {output_file}")
        
        with open(output_file, 'w', newline='', encoding='utf-8') as f:
            writer = csv.writer(f)
            
            writer.writerow([
                'Filename', 'Mittente', 'Oggetto', 'Data', 
                'URL_Count', 'Lista_URL', 'Parole_Chiave', 
                'Anteprima_Corpo', 'Spam_Score', 'Readability'
            ])
            
            for email in self.processed_emails:
                writer.writerow([
                    email['filename'],
                    email['sender'],
                    email['subject'],
                    email['date'].strftime('%Y-%m-%d %H:%M:%S'),
                    email['url_count'],
                    ' | '.join(email['urls']),
                    ', '.join(email['keywords_found']),
                    email['body'][:100] + '...' if len(email['body']) > 100 else email['body'],
                    'Spam' if email['is_spam'] else 'Pulita',
                    email.get('readability_score', 0)
                ])
    
    def export_json(self, output_file):
        print(f"📊 Esportazione JSON: {output_file}")
        
        json_data = []
        for email in self.processed_emails:
            json_email = email.copy()
            json_email['date'] = email['date'].isoformat()
            json_data.append(json_email)
        
        with open(output_file, 'w', encoding='utf-8') as f:
            json.dump(json_data, f, indent=2, ensure_ascii=False)
    
    def generate_summary(self):
        summary = []
        summary.append("="*60)
        summary.append("📈 RIASSUNTO ANALISI EMAIL")
        summary.append("="*60)
        summary.append(f"🕐 Data analisi: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
        summary.append(f"📂 Cartella analizzata: {self.folder_path}")
        summary.append("")
        
        summary.append("📊 STATISTICHE GENERALI")
        summary.append("-" * 30)
        summary.append(f"📧 File totali trovati: {self.stats['total_files']}")
        summary.append(f"✅ Email caricate: {self.stats['loaded_emails']}")
        summary.append(f"🗑️  Spam rimossi: {self.stats['spam_removed']}")
        summary.append(f"🔄 Duplicati rimossi: {self.stats['duplicates_removed']}")
        summary.append(f"🎯 Email finali: {self.stats['final_count']}")
        summary.append("")
        
        if not self.processed_emails:
            summary.append("❌ Nessuna email da analizzare")
            return "\n".join(summary)
        
        senders = {}
        for email in self.processed_emails:
            sender = email['sender']
            senders[sender] = senders.get(sender, 0) + 1
        
        summary.append("🔝 TOP 10 MITTENTI")
        summary.append("-" * 30)
        for sender, count in sorted(senders.items(), key=lambda x: x[1], reverse=True)[:10]:
            summary.append(f"{sender}: {count} email")
        summary.append("")
        
        all_keywords = []
        for email in self.processed_emails:
            all_keywords.extend(email['keywords_found'])
        
        if all_keywords:
            keyword_count = {}
            for kw in all_keywords:
                keyword_count[kw] = keyword_count.get(kw, 0) + 1
            
            summary.append("🔍 PAROLE CHIAVE TROVATE")
            summary.append("-" * 30)
            for kw, count in sorted(keyword_count.items(), key=lambda x: x[1], reverse=True):
                summary.append(f"'{kw}': {count} volte")
            summary.append("")
        
        emails_with_urls = sum(1 for email in self.processed_emails if email['urls'])
        total_urls = sum(len(email['urls']) for email in self.processed_emails)
        
        summary.append("🔗 STATISTICHE URL")
        summary.append("-" * 30)
        summary.append(f"Email con URL: {emails_with_urls}")
        summary.append(f"URL totali: {total_urls}")
        summary.append("")
        
        dates = [email['date'] for email in self.processed_emails]
        if dates:
            min_date = min(dates)
            max_date = max(dates)
            summary.append("📅 PERIODO TEMPORALE")
            summary.append("-" * 30)
            summary.append(f"Prima email: {min_date.strftime('%Y-%m-%d')}")
            summary.append(f"Ultima email: {max_date.strftime('%Y-%m-%d')}")
            summary.append("")
        
        return "\n".join(summary)
    
    def save_summary(self, output_file):
        summary = self.generate_summary()
        
        with open(output_file, 'w', encoding='utf-8') as f:
            f.write(summary)
        
        print(f"📄 Riassunto salvato: {output_file}")
        return summary
    
    def print_summary(self):
        summary = self.generate_summary()
        print(summary)

class TestEmailAnalyzer(unittest.TestCase):
    def setUp(self):
        self.analyzer = EmailAnalyzer('test_folder')
    
    def test_extract_urls(self):
        text = "Visita https://example.com e http://test.org per info"
        urls = self.analyzer.extract_urls(text)
        self.assertEqual(len(urls), 2)
        self.assertIn('https://example.com', urls)
        self.assertIn('http://test.org', urls)
    
    def test_detect_spam(self):
        spam_text = "GRATIS! Vinci ora! Click here for unlimited offers!"
        is_spam = self.analyzer.detect_spam_advanced('noreply@spam.com', 'URGENT!', spam_text)
        self.assertTrue(is_spam)
    
    def test_keyword_regex(self):
        self.analyzer.keyword_patterns = [('fattura', re.compile(r'\bfattura\b', re.IGNORECASE))]
        found = self.analyzer.find_keywords_regex('La tua fattura è pronta')
        self.assertIn('fattura', found)
        
        found = self.analyzer.find_keywords_regex('Il processo di fatturazione')
        self.assertEqual(len(found), 0)

def main():
    parser = argparse.ArgumentParser(description='Analizza email da una cartella (Versione Avanzata)')
    parser.add_argument('folder', help='Cartella contenente le email')
    parser.add_argument('-o', '--output', default='email_analysis', 
                       help='Nome file di output (senza estensione)')
    parser.add_argument('-f', '--format', choices=['csv', 'json', 'both'], 
                       default='both', help='Formato di output')
    parser.add_argument('-c', '--config', help='File di configurazione (.yaml, .json, .ini)')
    parser.add_argument('-k', '--keywords', nargs='+', 
                       help='Parole chiave personalizzate (supporta regex)')
    parser.add_argument('-s', '--similarity', type=float, default=0.8,
                       help='Soglia di similarità per duplicati (0.0-1.0)')
    parser.add_argument('--no-recursive', action='store_true',
                       help='Non cercare nelle sottocartelle')
    parser.add_argument('--debug', action='store_true',
                       help='Modalità debug')
    parser.add_argument('--test', action='store_true',
                       help='Esegui test automatici')
    parser.add_argument('--save-summary', action='store_true',
                       help='Salva riassunto in file .txt')
    
    args = parser.parse_args()
    
    if args.test:
        print("🧪 Esecuzione test automatici...")
        unittest.main(argv=[''], exit=False, verbosity=2)
        return
    
    if not os.path.exists(args.folder):
        print(f"❌ Cartella non trovata: {args.folder}")
        return
    
    print("🚀 Avvio Email Analyzer Advanced")
    print("-" * 40)
    
    config = EmailAnalyzer(args.folder).load_default_config()
    
    if args.config:
        analyzer = EmailAnalyzer(args.folder)
        config = analyzer.load_config_file(args.config)
    
    if args.keywords:
        config['keywords'] = args.keywords
    if args.similarity:
        config['similarity_threshold'] = args.similarity
    if args.no_recursive:
        config['recursive'] = False
    if args.debug:
        config['debug'] = True
    
    analyzer = EmailAnalyzer(args.folder, config)
    
    analyzer.read_emails()
    
    if not analyzer.emails:
        print("❌ Nessuna email da processare")
        return
    
    analyzer.remove_duplicates_and_spam()
    
    if args.format in ['csv', 'both']:
        analyzer.export_csv(f"{args.output}.csv")
    
    if args.format in ['json', 'both']:
        analyzer.export_json(f"{args.output}.json")
    
    if args.save_summary:
        analyzer.save_summary(f"{args.output}_summary.txt")
    
    analyzer.print_summary()
    
    print(f"\n✅ Analisi completata! File salvati come: {args.output}.*")

if __name__ == "__main__":
    main()

Fondatori