Wednesday, May 30, 2018

IP to country mapping in under 1MB (node js)

Hi all,
I recently wrote some node js code to do IP to country code mapping that uses less than 1MB of RAM for data storage. It also uses the ARIN registry data directly and so doesn't require attribution, unlike some of the commercial sources out there. Because it takes up little memory, it's feasible to keep the table in-memory on my tiny web server, which means I avoid any RPC costs associated with IP lookup.

The gist is to first pull a copy of the ARIN IP registry data:

wget ftp://ftp.arin.net/pub/stats/arin/delegated-arin-extended-latest
wget ftp://ftp.ripe.net/pub/stats/ripencc/delegated-ripencc-latest
wget ftp://ftp.lacnic.net/pub/stats/lacnic/delegated-lacnic-latest
wget ftp://ftp.apnic.net/pub/stats/apnic/delegated-apnic-latest
wget ftp://ftp.afrinic.net/pub/stats/afrinic/delegated-afrinic-latest

Then do a bit of filtering and sorting:

cat delegated-* | grep '|ipv4' | sort -t '|' -k 4,4 -k 5 -V > ./comb3
cat delegated-* | grep '|ipv6' | sort -t '|' -k 4,4 -k 5 -V > ./comb3.ipv6

Then build the memory structures and serialize them to disk. I handle IPv4 and IPv6 differently. Both use the raw Uint8Array type.

IPv4

The ARIN data specifies IP address ranges as an IP followed by a length, and the length is not necessarily a power of two. Since IP address are pretty densely allocated, the datastructure I use is a 65kB array mapping the first two numbers in an IP address to a run-length-coded list of (range + countryCode) blocks and gaps between ranges. Countrycode takes up one byte and the range or gap length is one byte (storing log of the length), plus a bunch of special cases for non-power-of-two lengths and so forth.

IPv6

ARIN specifies IPv6 ranges in CIDR notation, so it's a prefix followed by the number of significant bits in the prefix. Because of that and because the IPv6 range is very sparse, I use a simple trie with one level for each IP group (hextet, e.g. "2a0a"). Each level is a hash table with a 2-byte key (for the hextet) and a 3-byte value. The value is either one byte of CIDR routing prefix (e.g. the '20' in '/20') followed by two bytes of country code, or it is 3-bytes of offset indicating the location of the next level down in the trie. There's enough extra space in there to have a special bit to indicate an empty entry.

Because this data structure is read-only in production use, each hash table is resized to something like 1.3 times the number of entries.

Space usage and performacne

As of May 30, 2018, the ipv4 table is 634kB and the ipv6 table is 302kB.

With some dumb benchmarking, it looks like lookup is about 30us for either of ipv4 or ipv6. I will note that there are certain ipv4 blocks where the linked list gets long. I think I noted a max length of 512 entries for one, so user beware. One could probably add some skip-list-like indexes to the longer lists to limit the worst-case lookup time, or better yet, use the fact that most of the time ranges are power-of-two length so probably a list is not the best representation.

The node source code for all this is about 600 lines. I could probably open-source it if there's enough interest.

No comments:

Post a Comment