tl;dr There’s a hidden Amazon.com of illegal drugs: I scraped the entire “Cocaine” category, then made a bot that can intelligently price a kilo. Don’t do drugs.
DARK WEB DRUG MARKET ANALYSIS
Project Objective: Use machine learning regression models to predict a continuous numeric variable from any web-scraped data set.
: Price distributions on the hidden markets of the mysterious dark web
! Money, mystery, and machine learning.
: Turns out it is remarkably easy for anyone with internet access to visit dark web marketplaces and browse product listings. In this project I use Python to simulate the behavior of a human browsing these markets, selectively collect and save information from each market page this browsing agent views, and finally use the collected data in aggregate to construct a predictive pricing model.
(Optional Action Adventure Story Framing)
After bragging a little too loudly in a seedy Mexican cantina about your magic data science powers of prediction, you have been kidnapped by a forward-thinking drug cartel. They have developed a plan to sell their stock of cocaine on the internet. They demand that you help them develop a pricing model that will give them the most profit. If you do not, your life will be forfeit!
You, knowing nothing about cocaine or drug markets, immediately panic. Your life flashes before your eyes as the reality of your tragic end sets in. Eventually, the panic subsides and you remember that if you can just browse the market, you might be able to pick up on some patterns and save your life…
THE (HIDDEN) DOMAIN
Dark web marketplaces (cryptomarkets) are internet markets that facilitate anonymous buying and selling. Anonymity means that many of these markets trade illegal goods, as it is inherently difficult for law enforcement to intercept information or identify users.
While black markets
have existed as long as regulated commerce itself, dark web markets were born somewhat recently when 4 technologies combined:
- Anonymous internet browsing (e.g. Tor and the Onion network)
- Virtual currencies (e.g. Bitcoin)
- Escrow (conditional money transfer)
- Vendor feedback systems (Amazon.com-like ratings of sellers)
Total cash flow through dark web markets is hard to estimate, but indicators show it as substantial and rising. The biggest vendors can earn millions of dollars per year.
In order to find a target variable suitable for linear regression, we’ll isolate our study to a single product type and try to learn its pricing scheme. For this analysis I choose to focus specifically on the cocaine sub-market. Cocaine listings consistently:
- report quantity in terms of the same metric scale (
- report quality in terms of numerical percentages (e.g.
These features give us anchors to evaluate each listing relative to others of its type, and make comparisons relative to a standard unit
1 gram 100% pure.
Browsing Dream Market reveals a few things:
- There are about 5,000 product listings in the Cocaine category.
- Prices trend linearly with quantity, but some vendors sell their cocaine for less than others.
- Vendors ship from around the world, but most listings are from Europe, North America, and other English speaking regions.
- Vendors are selective about which countries they are willing to ship to.
- Many vendors will ship to any address worldwide
- Some vendors explicitly refuse to deliver to the US, Australia, and other countries that have strict drug laws or border control.
- Shipping costs are explicitly specified in the listing.
- Shipping costs seem to correlate according to typical international shipping rates for small packages and letters.
- Many vendors offer more expensive shipping options that offer more “stealth”, meaning more care is taken to disguise the package from detection, and it is sent via a tracked carrier to ensure it arrives at the intended destination.
- The main factor that determines price seems to be quantity, but there are some other less obvious factors too.
While the only raw numerical quantities attached to each listing are BTC Prices and Ratings, there are some important quantities represented as text in the product listing title:
- how many “grams” the offer is for
- what “percentage purity” the cocaine is
These seem like they will be the most important features for estimating how to price a standard unit of cocaine.
I decide to deploy some tools to capture all the data relating to these patterns we’ve noticed.
automates the process of capturing information from HTML tags based on patterns I specify. For example, to collect the title strings of each cocaine listing, I use BeautifulSoup to search all the HTML of each search results page for
tags that have
class=productTitle, and save the text contents of any such tag found.
automates browsing behavior. In this case, its primary function is simply to go to the market listings and periodically click to the next page of search results, so that BeautifulSoup can then scrape the data. I set a
timeout in the code so that the function would make http requests at a reasonably slow rate.
to tabulate the data with Python, manipulate it, and stage it for analysis.
Matplotlib and Seaborn, handy Python libraries for charting and visualizing data
for regression models and other machine learning methods.
[Image: Automated Browsing Behavior with Selenium WebDriver]
I build a dictionary of page objects, which includes:
- product listing
- listing title
- listing price
- vendor name
- vendor rating
- number of ratings
- ships to / from
The two most important numeric predictors, product quantity and quality (# of grams, % purity), are embedded in the title string. I use regular expressions to parse these string values from each title string (where present), and transform these values to numerical quantities. For example “24 Grams 92% Pure Cocaine” yields the values
grams = 24and
quality = 92 in the dataset.
Vendors use country code strings to specify where they ship from, and where they are willing to ship orders to.
For example, a vendor in Great Britain may list shipping as “GB – EU, US”, indicating they ship to destinations in the European Union or the United States.
In order to use this information as part of my feature set, I transform these strings into corresponding “dummy” boolean values. That is, for each data point I create new columns for each possible origin and destination country, containing values of either
False to indicate whether the vendor has listed the country in the product listing. For example:
Ships to US: False
After each page dictionary is built (i.e. one pass of the code over the website), the data collection function saves the data as a JSON file (e.g.
page12.json). This is done so that information is not lost if the connection is interrupted during the collection process, which can take several minutes to hours. Whenever we want to work with collected data, we merge the JSON files together to form a Pandas data frame.
The cleaned dataset yielded approximately 1,500 product listings for cocaine.
Here they are if you care to browse yourself!
Aside on Interesting Findings
There are a lot of interesting patterns in this data, but I’ll just point out a few relevant to our scenario:
- Of all countries represented, the highest proportion of listings have their shipping origin in the Netherlands (NL). This doesn’t imply they are also the highest in volume of sales, but they are likely correlated. Based on this data, I would guess that the Netherlands has a thriving cocaine industry. NL vendors also seem to price competitively.
- As of July 15th, 2017, cocaine costs around $90 USD per gram. (median price per gram):
- Prices go up substantially for anything shipped to or from Australia:
* charts generated from data using Seaborn
THE MACHINE LEARNING
In order to synthesize all of the numeric information we are now privy to, I next turn to scikit-learn and its libraries for machine learning models. In particular, I want to evaluate how well models in the linear regression family and decision tree family of models fit my data.
Model Types Evaluated
- Linear Regression w/o regularization
- LASSO Regression (L1 regularization)
- Ridge Regression (L2 regularization)
- Random Forests
- Gradient Boosted Trees
To prepare the data, I separate my target variable (Y = Price) from my predictor features (X = everything else). I drop any variables in X that leak information about price (such as cost per unit). I’m left with the following set of predictor variables:
- Number of Grams
- Percentage Quality
- Rating out of 5.00
- Count of successful transactions for vendor on Dream Market
- Escrow offered? [0/1]
- Shipping Origin ([0/1] for each possible country in the dataset)
- Shipping Destination ([0/1] for each possible country in the dataset)
I split the data into random training and test sets (pandas dataframes) so I can evaluate performance using scikit-learn. Since I can’t fully account for stratification within the groups that I’m not accounting for, I take an average of scores over multiple evaluations.
Of the linear models, simple linear regression performed the best, with an average cross-validation R^2 “score” of around 0.89, meaning it accounts for about 89% of the actual variance.
Of the decision tree models, the Gradient Boosted trees approach resulted in the best prediction performance, yielding scores around 0.95. The best learning rate I observed to be 0.05, and the other options were kept at the default setting for the sci-kit learn library.
The model that resulted from the Gradient Boosted tree method picked up on a feature that revealed that 1-star ratings within the past 1 month were charateristic with vendors selling at lower prices.
Prediction: Pricing a Kilogram
(Note: I employ forex_python to convert bitcoin prices to other currencies.)
I evaluate the prediction according to each of the two models described above, as well as naive baseline:
- Naive approach: Take median price of 1 gram and multiply by 1000.
- Resulting price estimate: ~$90,000
- Review: Too expensive, no actual listings are anywhere near this high.
- Linear Regression Model: Fit a line to all samples and find the value at grams = 1000.
- Resulting price estimate: ~$40,000
- Review: Seems reasonable. But a model that account for more variance may give us a better price…
- Gradient Boosted Tree Model: Fit a tree and adjust the tree to address errors.
- Resulting price estimate: ~$50,000 (Best estimate)
- Review: Closest to actual prices listed for a kilogram. Model accounts for most of the observed variance.
THE BIG IDEAS
Darknet markets: large-scale, anonymous trade of goods, especially drugs. Accessible to anyone on the internet.
You can scrape information from dark net websites to get data about products.
Aggregating market listings can tell us about the relative value of goods offered, and how that value varies.
We can use machine learning to model the pricing intuitions of drug sellers.
(Optional Action Adventure Story Conclusion)
The drug cartel is impressed with your hacking skills, and they agree to adjust the pricing of their international trade according to your model. Not only do they let you live, but to your dismay, they promote you to lieutenant and place you in charge of accounting! You immediately begin formulating an escape in secret. Surely random forest models can help…