``` ├── LICENSE ├── README.md ├── requirements.txt ├── scopify.py ``` ## /LICENSE ``` path="/LICENSE" MIT License Copyright (c) 2024 Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ``` ## /README.md # Scopify - Netify.ai Reconnaissance Tool [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) **Scopify** is a Python command-line tool designed for penetration testers and bug bounty hunters to quickly gather and analyze infrastructure information (CDN, Hosting, SaaS) for a target company by scraping `netify.ai`. Developed by [@jhaddix](https://x.com/Jhaddix) and [Arcanum Information Security](https://www.arcanum-sec.com/). It optionally leverages OpenAI's API to provide AI-driven analysis of the gathered infrastructure, highlighting potential areas of interest and suggesting reconnaissance methodologies. ## Setup 1. **Clone the repository (if applicable) or ensure you have the files:** * `scopify.py` * `requirements.txt` 2. **Create a Python virtual environment:** ```bash python3 -m venv venv ``` 3. **Activate the virtual environment:** * On Linux/macOS: ```bash source venv/bin/activate ``` * On Windows: ```bash .\venv\Scripts\activate ``` 4. **Install dependencies:** ```bash pip install -r requirements.txt ``` ## Usage Run the script from the command line using the `-c` or `--company` flag followed by the company name (lowercase, use hyphens if needed based on `netify.ai`'s URL structure). ```bash python scopify.py -c [--analyze] ``` **Arguments:** * `-c`, `--company`: (Required) The target company name. * `--analyze`: (Optional) Enables AI analysis of the scraped data using OpenAI. Requires the `OPENAI_API_KEY` environment variable to be set. **Environment Variable for AI Analysis:** To use the `--analyze` feature, you **must** set your OpenAI API key as an environment variable named `OPENAI_API_KEY` before running the script. * On Linux/macOS: ```bash export OPENAI_API_KEY='your-api-key-here' ``` * On Windows (Command Prompt): ```bash set OPENAI_API_KEY=your-api-key-here ``` * On Windows (PowerShell): ```bash $env:OPENAI_API_KEY = 'your-api-key-here' ``` Replace `'your-api-key-here'` with your actual OpenAI API key. **Example (Basic):** ```bash python scopify.py -c walmart ``` ```bash --- CDNs --- CDNs # of IPs --------------------------------------------- Akamai 10382 Amazon CloudFront 3076 Fastly 331 Cloudflare 16 Azure Front Door 5 --- Hosting --- Cloud Hosts # of IPs --------------------------------------------- Google Hosted 245 Amazon AWS 28 Unitas Global 17 Microsoft Azure 15 Equinix 3 Cloudinary 1 Google Cloud Platform 1 Rackspace 1 WP Engine 1 --- SaaS --- SaaS # of IPs --------------------------------------------- Email Studio 221 Adobe EM 168 Adobe Ads 59 SendGrid 16 Salesforce 6 Validity 5 LexisNexis Risk 3 Mailgun 2 Pendo 2 MarkMonitor 1 Medallia 1 ``` **Example (With AI Analysis):** ```bash --- AI Analysis --- Analyzing infrastructure data with OpenAI... 1. CDN OBSERVATIONS - Akamai (10 382 IPs) • Global edge network with robust WAF capabilities (Kona, GTM, Bot Manager). • Look for subdomain–origin mismatches (staging/test instances) via wildcard DNS or certificate transparency logs. • Test Host header and path‐based routing bypasses to reach internal origins. - Amazon CloudFront (3 076 IPs) • Common misconfiguration: S3 bucket origin left public or locked behind custom domain. • Probe for Host header overrides and unvalidated redirects. • Enumerate unused edge configurations via custom CNAMEs in DNS records. - Fastly (331 IPs) • Default VCL snippets can leak backend hostnames. • Potential open proxy behavior if VCL not locked down. • Fingerprint backend exposures by sending unusual HTTP verbs and headers. - Cloudflare (16 IPs) • WAF and rate‐limiting active, but origin IPs often exposed via DNS history or archived scan data. • Check for subdomain takeover on unclaimed DNS entries (e.g., *.walmart.com pointing to Cloudflare but unregistered in Cloudflare dashboard). - Azure Front Door (5 IPs) • Host rewrite misconfigurations may allow Host header bypass. • Verify custom domain validation to prevent unwanted CNAME mapping. 2. HOSTING OBSERVATIONS - Google Hosted (245 IPs) • High volume suggests static asset or microservice hosting. • GCP metadata service attacks if misconfigured; check for exposed metadata endpoints via SSRF. - Amazon AWS (28 IPs) • Potential IAM role exposure in EC2 metadata; test for SSRF. • Publicly exposed services (e.g., ELB, API Gateway) could reveal unused endpoints. - Unitas Global (17 IPs) & Equinix (3 IPs) • Likely colocation/shared transit. • Scan for open management interfaces (SSH, RDP) and default credentials. - Microsoft Azure (15 IPs) • Similar SSRF/metadata considerations. • Check for Azure‑specific services (App Service, Function Apps) with default subdomains. - Single‐IP Hosts (Cloudinary, GCP, Rackspace, WP Engine) • Specialized services; enumerate hostnames to uncover asset footprint or CMS backends. 3. SAAS OBSERVATIONS - Email Studio (221 IPs), Adobe EM (168 IPs), Adobe Ads (59 IPs) • Marketing automation platforms. • Inspect tracking pixels, CORS policies, and parameter injection in campaign URLs. - SendGrid (16 IPs), Mailgun (2 IPs) • Email delivery APIs. • Test URL callbacks, webhook endpoints, and API key exposure in front‑end code. - Salesforce (6 IPs) • CRM integration points; possible OAuth endpoints. • Look for custom subdomains (e.g., mycompany.salesforce.com) ripe for subdomain takeover or exposed metadata. - Validity (5 IPs), LexisNexis Risk (3 IPs) • Data quality/risk scoring. • Assess JavaScript SDK integrations for unsafe POST requests or leakage of PII. - Pendo (2 IPs), Medallia (1 IP), MarkMonitor (1 IP) • In‑app analytics and feedback widgets. • Scrutinize embedded scripts for client‑side logic flaws (XSS, insecure storage). 4. METHODOLOGY - Subdomain Enumeration • Use modern tooling to exhaust DNS, certificate transparency, and brute lists. - WAF Fingerprinting & Bypass Testing • Send crafted payloads and monitor differences in response codes/headers to distinguish between CDNs. - Origin Exposure Testing • Override DNS resolution locally to connect directly to edge or origin IPs and bypass CDN protections. - Cloud Storage Enumeration • Query GrayHatWarfare for public bucket listings: https://buckets.grayhatwarfare.com/files?keywords=walmart - SaaS Integration Review • Crawl front‑end code for third‑party SDKs, inspect endpoints, test CORS and authentication flows. - Metadata & SSRF Checks • Target AWS/Azure/GCP metadata URLs via any SSRF‑susceptible parameter. - Service Scan & Port Verification • Validate open ports and banner grabs on hosting IP ranges to identify exposed management interfaces. All findings should guide focused audit scopes and safe‑safe proof‑of‑concepts in line with Walmart’s bug bounty policy. ``` **Output:** The script will output sorted tables for CDNs, Hosting providers, and SaaS platforms used by the specified company. If `--analyze` is used and the API key is set correctly, an AI-generated summary and analysis relevant to penetration testing/bug bounty hunting will be printed after the tables. If the company page is not found (404 error), an error message will be displayed suggesting you check the company name spelling and format. A `debug_.html` file is also created/overwritten in the same directory, containing the raw HTML source of the scraped page (or the error page) for debugging purposes. ## License This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details. ## Contributing Contributions are welcome! Please feel free to submit pull requests or open issues to suggest features or report bugs. 1. Fork the Project 2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`) 3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`) 4. Push to the Branch (`git push origin feature/AmazingFeature`) 5. Open a Pull Request ## Disclaimer This tool is intended for educational and authorized security testing purposes only. Use it responsibly and ethically. The developers assume no liability and are not responsible for any misuse or damage caused by this tool. Always ensure you have explicit permission before scanning any target. ## Acknowledgements * Data sourced from [Netify.ai](https://www.netify.ai/) * AI Analysis powered by [OpenAI](https://openai.com/) ## /requirements.txt requests beautifulsoup4 openai ## /scopify.py ```py path="/scopify.py" import argparse import requests from bs4 import BeautifulSoup import re import os from openai import OpenAI def analyze_with_openai(company_name, data): """Analyzes the scraped data using OpenAI API.""" api_key = os.getenv("OPENAI_API_KEY") if not api_key: print("\nError: OPENAI_API_KEY environment variable not set.") print(" Please set it to use the --analyze feature.") return None client = OpenAI(api_key=api_key) model = "o4-mini" # Format the data for the prompt prompt_data = f"Analysis Target: {company_name}\n\n" if data.get('cdn'): prompt_data += "--- CDNs ---\n" prompt_data += f"{'CDN Name':<30} {'# of IPs'}\n" prompt_data += "-"*45 + "\n" for name, count in data['cdn']: prompt_data += f"{name:<30} {count}\n" prompt_data += "\n" if data.get('hosting'): prompt_data += "--- Hosting ---\n" prompt_data += f"{'Cloud Host':<30} {'# of IPs'}\n" prompt_data += "-"*45 + "\n" for name, count in data['hosting']: prompt_data += f"{name:<30} {count}\n" prompt_data += "\n" if data.get('saas'): prompt_data += "--- SaaS ---\n" prompt_data += f"{'SaaS Platform':<30} {'# of IPs'}\n" prompt_data += "-"*45 + "\n" for name, count in data['saas']: prompt_data += f"{name:<30} {count}\n" prompt_data += "\n" system_prompt = ("You are a bug bounty hunter specializing in infrastructure reconnaissance " "Your task is to analyze the provided CDN, Hosting, " "and SaaS information for a target company and provide a concise summary highlighting potential areas of interest, " "attack vectors based *solely* on this infrastructure data.\n\nFocus on:\n" "- The significance of specific CDNs (e.g., WAF capabilities, common misconfigurations).\n" "- Implications of the hosting providers (e.g., cloud security models, potential for exposed services on specific providers).\n" "- SaaS platforms that might indicate integration points, authentication mechanisms, or data storage locations relevant to security testing.\n" "- Consider the relative number of IPs associated with each service as a potential indicator of scale or importance.\n" "- Do not invent information not present in the tables. Provide actionable insights based *only* on the provided data.\n" "- Provide any verified methodology to bug bounty hunt on the analysis. When describing methodology, focus on the *type* of action (e.g., subdomain enumeration, WAF fingerprinting) and suggest using 'modern tools' or 'appropriate tooling' for the task, rather than naming specific software tools.\n" "- For the cloud section, attempt to build a GrayHatWarfare link for the user to click on to look at this target. Use the format: https://buckets.grayhatwarfare.com/files?keywords= (replace with the actual target company name).\n\n" "**Formatting Instructions:** Structure your response for easy readability on a standard terminal. Use clear headings for sections (e.g., '1. CDN OBSERVATIONS'). Use double newlines to separate major sections and single newlines for list items under headings. Ensure bullet points (-) are clearly indented." ) print("\n--- AI Analysis --- ") print("Analyzing infrastructure data with OpenAI...\n") try: response = client.chat.completions.create( model=model, messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": prompt_data} ] ) analysis = response.choices[0].message.content return analysis.strip() except Exception as e: print(f"\nError during OpenAI API call: {e}") return None def scrape_netify(company_name): """Scrapes netify.ai for CDN, Hosting, and SaaS information and returns it.""" url = f"https://www.netify.ai/resources/applications/{company_name.lower()}" html_filename = f"debug_{company_name.lower()}.html" scraped_data = {'cdn': [], 'hosting': [], 'saas': []} try: headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'} response = requests.get(url, headers=headers, timeout=15) response.raise_for_status() # Raises HTTPError for bad responses (4xx or 5xx) # Save HTML content for debugging try: with open(html_filename, 'w', encoding='utf-8') as f: f.write(response.text) # print(f"Saved HTML content to {html_filename}") # Keep this commented except IOError as e: print(f"Error saving HTML to {html_filename}: {e}") except requests.exceptions.HTTPError as http_err: if http_err.response.status_code == 404: print(f"\nError: Could not find page for '{company_name}'.") print(f" Please check the company name spelling and format.") print(f" URL attempted: {url}") # Optionally save the 404 page content try: with open(html_filename, 'w', encoding='utf-8') as f: f.write(http_err.response.text) # print(f"Saved 404 page content to {html_filename}") except IOError as io_e: print(f"Error saving 404 HTML to {html_filename}: {io_e}") else: print(f"\nHTTP Error occurred: {http_err}") # Save error response content try: with open(html_filename, 'w', encoding='utf-8') as f: f.write(http_err.response.text) # print(f"Saved error response HTML content to {html_filename}") except IOError as io_e: print(f"Error saving error response HTML to {html_filename}: {io_e}") return None # Indicate error by returning None except requests.exceptions.RequestException as req_err: print(f"\nError fetching URL {url}: {req_err}") return None # Indicate error by returning None soup = BeautifulSoup(response.content, 'html.parser') # --- Extract CDN Info --- cdn_table = soup.find('table', id='cdn-list-networks-summary') if cdn_table: try: cdn_data_raw = [] for row in cdn_table.find('tbody').find_all('tr'): cols = row.find_all('td') if len(cols) == 2: service_link = cols[0].find('a') service_name = service_link.text.strip() if service_link else cols[0].text.strip() ip_count_str = cols[1].text.strip() try: ip_count = int(ip_count_str) cdn_data_raw.append((service_name, ip_count)) except ValueError: # Silently skip rows with unparseable IP counts for this section pass # Sort data by IP count (descending) and store scraped_data['cdn'] = sorted(cdn_data_raw, key=lambda item: item[1], reverse=True) except AttributeError: print("Warning: Error parsing CDN table structure.") # --- Extract Hosting Info --- hosting_header = soup.find('h3', string=re.compile(r'Hosting', re.IGNORECASE)) if hosting_header: hosting_table = soup.find('table', id='cloud-host-networks-summary') if hosting_table: try: hosting_data_raw = [] for row in hosting_table.find('tbody').find_all('tr'): cols = row.find_all('td') if len(cols) == 2: host_link = cols[0].find('a') host_name = host_link.text.strip() if host_link else cols[0].text.strip() ip_count_str = cols[1].text.strip() try: ip_count = int(ip_count_str) hosting_data_raw.append((host_name, ip_count)) except ValueError: pass # Skip unparseable # Sort data by IP count (descending) and store scraped_data['hosting'] = sorted(hosting_data_raw, key=lambda item: item[1], reverse=True) except AttributeError: print("Warning: Error parsing Hosting table structure.") # --- Extract SaaS Info --- saas_table = soup.find('table', id='saas-list-networks-summary') if not saas_table: # Attempt alternate selector if primary ID fails saas_table = soup.select_one('table[id^="saas-list-networks-summary"]') if saas_table: try: saas_data_raw = [] for row in saas_table.find('tbody').find_all('tr'): cols = row.find_all('td') if len(cols) == 2: saas_link = cols[0].find('a') saas_name = saas_link.text.strip() if saas_link else cols[0].text.strip() ip_count_str = cols[1].text.strip() try: ip_count = int(ip_count_str) saas_data_raw.append((saas_name, ip_count)) except ValueError: pass # Skip unparseable # Sort data by IP count (descending) and store scraped_data['saas'] = sorted(saas_data_raw, key=lambda item: item[1], reverse=True) except AttributeError: print("Warning: Error parsing SaaS table structure.") # Return the dictionary containing the lists of tuples # Return None if no sections were found at all (or only errors occurred) if not scraped_data['cdn'] and not scraped_data['hosting'] and not scraped_data['saas']: # Check if an HTTP/Request error already occurred (which returns None) # If soup exists, it means request was successful but parsing failed everywhere if 'soup' in locals(): print(f"\nWarning: No CDN, Hosting, or SaaS data found for '{company_name}'. Check debug HTML.") return None # Return None if truly no data found or request failed return scraped_data def print_table(title, headers, data): """Helper function to print a formatted table.""" print(f"--- {title} ---") if data: print(f"{headers[0]:<30} {headers[1]}") print("-"*45) for name, count in data: print(f"{name:<30} {count}") else: print("No data found for this section.") print("\n") def main(): parser = argparse.ArgumentParser(description='Netify Recon Tool') parser.add_argument('-c', '--company', required=True, help='Company name to lookup (e.g., walmart)') parser.add_argument('--analyze', action='store_true', help='Analyze the results with OpenAI (requires OPENAI_API_KEY env var)') args = parser.parse_args() print("\n") # Initial padding scraped_data = scrape_netify(args.company) if scraped_data: # Print the tables print_table("CDNs", ["CDNs", "# of IPs"], scraped_data.get('cdn')) print_table("Hosting", ["Cloud Hosts", "# of IPs"], scraped_data.get('hosting')) print_table("SaaS", ["SaaS", "# of IPs"], scraped_data.get('saas')) # Perform analysis if requested if args.analyze: analysis_result = analyze_with_openai(args.company, scraped_data) if analysis_result: # Simple post-processing for potentially better spacing # Add an extra newline before numbered sections if not already preceded by two import re formatted_analysis = re.sub(r'\n(?=\n*[0-9]+\.\s)', '\n\n', analysis_result) # Ensure spacing after the title if needed formatted_analysis = formatted_analysis.replace('--- AI Analysis ---', '--- AI Analysis ---\n') print(formatted_analysis.strip()) # Print the processed result print("\n") # Add spacing after analysis if __name__ == "__main__": main() ``` The better and more specific the context, the better the LLM can follow instructions. If the context seems verbose, the user can refine the filter using uithub. Thank you for using https://uithub.com - Perfect LLM context for any GitHub repo.