Our Wordlists Kinda Suck

Nov 5, 2023 · 788 words · 4 minute read

Maybe it’s because we dragged all our wordlists across from the days of Van Hauser’s Hydra way back in 2000. But something happened around the time when the OSCP certification began picking up steam. A wave of new tools, mostly written in either Go or Rust, flooded the interwebs. Along with these tools came a fleet of wordlists. Millions of words in a text file that were to be used for the sole purpose of brute-forcing. I think the most popular set of wordlists can be found here. Looking at the sheer number of wordlists for any occasion, one would think the proverb “a rolling stone gathers no moss” hasn’t really applied here. We’ve come so far and just look at all that moss we’ve collected! I would hope that we all use SecLists as a starting point and then slowly distil the wordlist down as we get to know our region or country better.

Nevertheless, I thought I would first look at the DNS wordlists in the repo. Find them under /Discovery/DNS. This directory has wordlists that you can use to brute-force subdomains. I wrote a go module to do two things:

Lex the wordlists and check if there are invalid characters for resolving DNS hosts and
Do a proper line count of the wordlist.

I ran the library on the files inside the DNS directory and here are the results:

➜  sheran@leonov linecount go test -run=TestLexer
2023/11/05 21:51:00 filename: bitquark-subdomains-top100000.txt error: invalid character '*' found at row 37212 col 1
2023/11/05 21:51:00 filename: bug-bounty-program-subdomains-trickest-inventory.txt linecount: 1613291
2023/11/05 21:51:00 filename: combined_subdomains.txt error: invalid character '*' found at row 1 col 1
2023/11/05 21:51:00 filename: deepmagic.com-prefixes-top500.txt linecount: 500
2023/11/05 21:51:00 filename: deepmagic.com-prefixes-top50000.txt error: invalid character '_' found at row 4715 col 7
2023/11/05 21:51:00 filename: dns-Jhaddix.txt error: invalid character '@' found at row 4 col 1
2023/11/05 21:51:00 filename: fierce-hostlist.txt error: invalid character '_' found at row 770 col 4
2023/11/05 21:51:00 filename: italian-subdomains.txt linecount: 20000
2023/11/05 21:51:00 filename: n0kovo_subdomains.txt error: invalid character '\n' found at row 240002 col 1
2023/11/05 21:51:00 filename: namelist.txt error: invalid character '_' found at row 4979 col 8
2023/11/05 21:51:00 filename: remain.txt linecount: 1497687
2023/11/05 21:51:00 filename: shubs-stackoverflow.txt error: invalid character ',' found at row 807 col 24
2023/11/05 21:51:00 filename: shubs-subdomains.txt error: invalid character '_' found at row 32597 col 8
2023/11/05 21:51:00 filename: sortedcombined-knock-dnsrecon-fierce-reconng.txt error: invalid character '_' found at row 2242 col 1
2023/11/05 21:51:00 filename: subdomains-spanish.txt error: invalid character ' ' found at row 411 col 8
2023/11/05 21:51:00 filename: subdomains-top1million-110000.txt error: invalid character '_' found at row 689 col 4
2023/11/05 21:51:00 filename: subdomains-top1million-20000.txt error: invalid character '_' found at row 689 col 4
2023/11/05 21:51:00 filename: subdomains-top1million-5000.txt error: invalid character '_' found at row 689 col 4
2023/11/05 21:51:00 filename: tlds.txt error: invalid character '[' found at row 1411 col 11
2023/11/05 21:51:00 files processed: 20 errors: 15
PASS
ok      github.com/sheran/linecount     0.280s
➜  sheran@leonov linecount

15 of the 20 files in that directory had invalid characters. By that I mean, if you ran that through gobuster’s DNS brute-forcer, those subdomains won’t resolve. That’s because the RFC for DNS has a preferred name syntax (section 3.5) where the ruleset is as follows:


<subdomain> ::= <label> | <subdomain> "." <label>

<label> ::= <letter> [ [ <ldh-str> ] <let-dig> ]

<ldh-str> ::= <let-dig-hyp> | <let-dig-hyp> <ldh-str>

<let-dig-hyp> ::= <let-dig> | "-"

<let-dig> ::= <letter> | <digit>

<letter> ::= any one of the 52 alphabetic characters A through Z in
upper case and a through z in lower case

<digit> ::= any one of the ten digits 0 through 9

So that’s it. a-z, A-Z, 0-9, and -. Those are the only characters that are allowed when looking up a host. This means that all the words with invalid characters will not resolve correctly and worse, will make your brute-force task slower. So why are these weird characters even there?

A few reasons. First, I know that there were programs that would do something with a wildcard like "*". The program itself would expand this to mean take hostname "starfish*" and expand it to mean "starfish1", "starfish2", … "starfish9". But the modern day tools like gobuster don’t do this. So essentially you’re ruining the efficiency of your already long ass brute-forcing session.

What should you do about it? Well, in my opinion, it makes sense to clean the files so that they only contain subdomains that have a chance of successfully resolving. That takes care of one part of the mess that is DNS brute-forcing. The next part involves developing a repeatable process for taking all the successful hits from these files and building a smaller more concentrated file of working subdomains. This can greatly speed up your discovery process.