httpz

- Hyper-fast HTTP Scraping Tool
git clone git://git.acid.vegas/httpz.git
README.md (8233B)
      1 # HTTPZ Web Scanner
      2 
      3 ![](./.screens/preview.gif)
      4 
      5 A high-performance concurrent web scanner written in Python. HTTPZ efficiently scans domains for HTTP/HTTPS services, extracting valuable information like status codes, titles, SSL certificates, and more.
      6 
      7 ## Requirements
      8 
      9 - [Python](https://www.python.org/downloads/)
     10   - [aiohttp](https://pypi.org/project/aiohttp/)
     11   - [beautifulsoup4](https://pypi.org/project/beautifulsoup4/)
     12   - [cryptography](https://pypi.org/project/cryptography/)
     13   - [dnspython](https://pypi.org/project/dnspython/)
     14   - [mmh3](https://pypi.org/project/mmh3/)
     15   - [python-dotenv](https://pypi.org/project/python-dotenv/)
     16 
     17 ## Installation
     18 
     19 ### Via pip *(recommended)*
     20 ```bash
     21 # Install from PyPI
     22 pip install httpz_scanner
     23 
     24 # The 'httpz' command will now be available in your terminal
     25 httpz --help
     26 ```
     27 
     28 ### From source
     29 ```bash
     30 # Clone the repository
     31 git clone https://github.com/acidvegas/httpz
     32 cd httpz
     33 pip install -r requirements.txt
     34 ```
     35 
     36 ## Usage
     37 
     38 ### Command Line Interface
     39 
     40 Basic usage:
     41 ```bash
     42 python -m httpz_scanner domains.txt
     43 ```
     44 
     45 Scan with all flags enabled and output to JSONL:
     46 ```bash
     47 python -m httpz_scanner domains.txt -all -c 100 -o results.jsonl -j -p
     48 ```
     49 
     50 Read from stdin:
     51 ```bash
     52 cat domains.txt | python -m httpz_scanner - -all -c 100
     53 echo "example.com" | python -m httpz_scanner - -all
     54 ```
     55 
     56 Filter by status codes and follow redirects:
     57 ```bash
     58 python -m httpz_scanner domains.txt -mc 200,301-399 -ec 404,500 -fr -p
     59 ```
     60 
     61 Show specific fields with custom timeout and resolvers:
     62 ```bash
     63 python -m httpz_scanner domains.txt -sc -ti -i -tls -to 10 -r resolvers.txt
     64 ```
     65 
     66 Full scan with all options:
     67 ```bash
     68 python -m httpz_scanner domains.txt -c 100 -o output.jsonl -j -all -to 10 -mc 200,301 -ec 404,500 -p -ax -r resolvers.txt
     69 ```
     70 
     71 ### Distributed Scanning
     72 Split scanning across multiple machines using the `--shard` argument:
     73 
     74 ```bash
     75 # Machine 1
     76 httpz domains.txt --shard 1/3
     77 
     78 # Machine 2
     79 httpz domains.txt --shard 2/3
     80 
     81 # Machine 3
     82 httpz domains.txt --shard 3/3
     83 ```
     84 
     85 Each machine will process a different subset of domains without overlap. For example, with 3 shards:
     86 - Machine 1 processes lines 0,3,6,9,...
     87 - Machine 2 processes lines 1,4,7,10,...
     88 - Machine 3 processes lines 2,5,8,11,...
     89 
     90 This allows efficient distribution of large scans across multiple machines.
     91 
     92 ### Python Library
     93 ```python
     94 import asyncio
     95 import urllib.request
     96 from httpz_scanner import HTTPZScanner
     97 
     98 async def scan_from_list() -> list:
     99     with urllib.request.urlopen('https://example.com/domains.txt') as response:
    100         content = response.read().decode()
    101         return [line.strip() for line in content.splitlines() if line.strip()][:20]
    102     
    103 async def scan_from_url():
    104     with urllib.request.urlopen('https://example.com/domains.txt') as response:
    105         for line in response:
    106             if line := line.strip():
    107                 yield line.decode().strip()
    108 
    109 async def scan_from_file():
    110     with open('domains.txt', 'r') as file:
    111         for line in file:
    112             if line := line.strip():
    113                 yield line
    114 
    115 async def main():
    116     # Initialize scanner with all possible options (showing defaults)
    117     scanner = HTTPZScanner(
    118         concurrent_limit=100,   # Number of concurrent requests
    119         timeout=5,              # Request timeout in seconds
    120         follow_redirects=False, # Follow redirects (max 10)
    121         check_axfr=False,       # Try AXFR transfer against nameservers
    122         resolver_file=None,     # Path to custom DNS resolvers file
    123         output_file=None,       # Path to JSONL output file
    124         show_progress=False,    # Show progress counter
    125         debug_mode=False,       # Show error states and debug info
    126         jsonl_output=False,     # Output in JSONL format
    127         shard=None,             # Tuple of (shard_index, total_shards) for distributed scanning
    128         
    129         # Control which fields to show (all False by default unless show_fields is None)
    130         show_fields={
    131             'status_code': True,      # Show status code
    132             'content_type': True,     # Show content type
    133             'content_length': True,   # Show content length
    134             'title': True,            # Show page title
    135             'body': True,             # Show body preview
    136             'ip': True,               # Show IP addresses
    137             'favicon': True,          # Show favicon hash
    138             'headers': True,          # Show response headers
    139             'follow_redirects': True, # Show redirect chain
    140             'cname': True,            # Show CNAME records
    141             'tls': True               # Show TLS certificate info
    142         },
    143         
    144         # Filter results
    145         match_codes={200,301,302},  # Only show these status codes
    146         exclude_codes={404,500,503} # Exclude these status codes
    147     )
    148 
    149     # Example 1: Process file
    150     print('\nProcessing file:')
    151     async for result in scanner.scan(scan_from_file()):
    152         print(f"{result['domain']}: {result['status']}")
    153 
    154     # Example 2: Stream URLs
    155     print('\nStreaming URLs:')
    156     async for result in scanner.scan(scan_from_url()):
    157         print(f"{result['domain']}: {result['status']}")
    158 
    159     # Example 3: Process list
    160     print('\nProcessing list:')
    161     domains = await scan_from_list()
    162     async for result in scanner.scan(domains):
    163         print(f"{result['domain']}: {result['status']}")
    164 
    165 if __name__ == '__main__':
    166     asyncio.run(main())
    167 ```
    168 
    169 The scanner accepts various input types:
    170 - File paths (string)
    171 - Lists/tuples of domains
    172 - stdin (using '-')
    173 - Async generators that yield domains
    174 
    175 All inputs support sharding for distributed scanning using the `shard` parameter.
    176 
    177 ## Arguments
    178 
    179 | Argument      | Long Form        | Description                                                 |
    180 |---------------|------------------|-------------------------------------------------------------|
    181 | `file`        |                  | File containing domains *(one per line)*, use `-` for stdin |
    182 | `-d`          | `--debug`        | Show error states and debug information                     |
    183 | `-c N`        | `--concurrent N` | Number of concurrent checks *(default: 100)*                |
    184 | `-o FILE`     | `--output FILE`  | Output file path *(JSONL format)*                           |
    185 | `-j`          | `--jsonl`        | Output JSON Lines format to console                         |
    186 | `-all`        | `--all-flags`    | Enable all output flags                                     |
    187 | `-sh`         | `--shard N/T`    | Process shard N of T total shards *(e.g., 1/3)*             |
    188 
    189 ### Output Field Flags
    190 
    191 | Flag   | Long Form            | Description                      |
    192 |--------| ---------------------|----------------------------------|
    193 | `-sc`  | `--status-code`      | Show status code                 |
    194 | `-ct`  | `--content-type`     | Show content type                |
    195 | `-ti`  | `--title`            | Show page title                  |
    196 | `-b`   | `--body`             | Show body preview                |
    197 | `-i`   | `--ip`               | Show IP addresses                |
    198 | `-f`   | `--favicon`          | Show favicon hash                |
    199 | `-hr`  | `--headers`          | Show response headers            |
    200 | `-cl`  | `--content-length`   | Show content length              |
    201 | `-fr`  | `--follow-redirects` | Follow redirects *(max 10)*      |
    202 | `-cn`  | `--cname`            | Show CNAME records               |
    203 | `-tls` | `--tls-info`         | Show TLS certificate information |
    204 
    205 ### Other Options
    206 
    207 | Option      | Long Form               | Description                                         |
    208 |-------------|-------------------------|-----------------------------------------------------|
    209 | `-to N`     | `--timeout N`           | Request timeout in seconds *(default: 5)*           |
    210 | `-mc CODES` | `--match-codes CODES`   | Only show specific status codes *(comma-separated)* |
    211 | `-ec CODES` | `--exclude-codes CODES` | Exclude specific status codes *(comma-separated)*   |
    212 | `-p`        | `--progress`            | Show progress counter                               |
    213 | `-ax`       | `--axfr`                | Try AXFR transfer against nameservers               |
    214 | `-r FILE`   | `--resolvers FILE`      | File containing DNS resolvers *(one per line)*      |