Nerdiness Quantified: What is a nerd?


Hello. I am a nerd. Depending on the context. Are you?

What is a nerd? What personality types consider themselves nerdy? What demographics characterize those who identify most strongly as nerdy?

This project uses quantitative methods to attempt these qualitative questions.


Nerdiness Assessment

The Nerdy Personality Assessment Scale is a survey freely available online that aims to quantify nerdiness.

The Nerdy Personality Attributes Scale was developed as a project to quantify what “nerdiness” is. Nerd is a common social label in English, although there is no set list of criteria. The NPAS was developed by surveying a very large pool of personality attributes to see which ones correlated with self reported nerd status, and combining them all into a scale. The NPAS can give an estimate of how much a respondent’s personality is similar to the average for those who identify as nerds versus those who do not.

Personality Testing offers an open and anonymized dataset of approximately 1500 responses to their NPAS survey, which include the data desribed below.

Here is the entire NPAS. Feel free to score yourself!

Procedure: The NPAS has 26 questions. In each questions you must rate how much you agree with a given statement on a five point scale:

 1=Disagree <> 5=Agree

  1. I sometimes prefer fictional people to real ones.
  2. I prefer academic success to social success.
  3. My appearance is not as important as my intelligence.
  4. I gravitate towards introspection.
  5. I am interested in science.
  6. I care about super heroes.
  7. I like science fiction.
  8. I spend recreational time researching topics others might find dry or overly rigorous.
  9. I get excited about my ideas and research.
  10. I like to play RPGs. (e.g. D&D)
  11. I collect books.
  12. I am a strange person.
  13. I would rather read a book than go to a party.
  14. I love to read challenging material.
  15. I spend more time at the library than any other public place.
  16. I would describe my smarts as bookish.
  17. I like to read technology news reports.
  18. I am more comfortable interacting online than in person.
  19. I was in advanced classes.
  20. I watch science related shows.
  21. I was a very odd child.
  22. I am more comfortable with my hobbies than I am with other people.
  23. I have started writing a novel.
  24. I can be socially awkward at times.
  25. I enjoy learning more than I need to.
  26. I have played a lot of video games.
Generally speaking, the higher your total, the more you align with statements from people who call themselves nerdy.
Check out the paper for more information on how this scale was developed.
Here’s a heatmap showing the correlation within these 26 questions. The darker the color, the more people answer those questions the same way. Everything has a positive correlation because these 26 questions were selected precisely because they reflect a similar underlying personality. However the degree to which any two questions correlate varies.
heatmap correlations npas.png
It’s hard to parse this if you haven’t seen a heatmap before, so I’ll just point out that Q11 and Q19, the darkest spots not on the diagonal line (and therefore the most correlated), are these two statements:

Q11 I am more comfortable with my hobbies than I am with other people. 

Q19 I have played a lot of video games.

I’ll leave it to the reader to conjecture about what this says about gamer nerds and their social skills. 😉

Data, cont.

Big Five Personality

In addition the NPAS questions, this particular site also administers a ten-item personality test (TIPI) based on the Big Five model. This test yields a value for each of the following traits of the taker (the “big five traits”):
  1. Openness to Experience (O)
  2. Conscientiousness (C)
  3. Extraversion (E)
  4. Agreeableness (A)
  5. Neuroticism (N)

In addition, this test asks for your personal association with the word nerdy, on a 1-7 scale. This is a very useful question for our quantitative approach! I’ve named the class of people who respond with a 6 or 7 as nerd champions.


A demographic form during the test asked for various demographic variables:
  • Age
  • Gender
  • Race
  • Years of Education
  • Urbanness as a Child
  • English Native
  • Handedness
  • Religious Category
  • Sexual Orientation
  • Voted in a natl. election in past year
  • Married
  • Number of you + any siblings growing up
  • Major in school
  • ASD: Have you ever been diagnosed with Autism Spectrum Disorder?
I thought that the ASD demographic question was particularly interesting, so I chose to focus on it as a target variable. It forms a binary class distinction (either you were diagnosed or you weren’t), so it can be used directly as labeled input to train classification algorithms.
A new question emerges…

How does autism contribute to the definition of nerdiness?

I looked into the representation in my data versus the general population and came up with some basic figures: 
Prevalence of ASD 

  • General Population  
    • ~1.5%  “Prevalence in the US is estimated at 1 in 68 births” CDC, 2014
  • My Sample: 
    • 5.5% of 1418 rows / responses (mostly US-based respondents) 

There is nearly 4 times the rate of ASD in this nerd-survey sample, relative to the entire population. I looked at the people in that 5.5% of my data, and charted a histogram of how they answered the “nerdiness-on-a-scale” question:
count of nerdy levels.png
Here’s how to interpret this chart: the blue bars are stacks of all the people who didn’t report a diagnosis of ASD, whereas the green bars are stacks of all the people who did. So the green bars make up that 5.5% of the data, and the blue bars are everybody else.
By looking at the pattern of just the blue bars, we see that in this sample most people answered between “4” and “7”, centered on “6” as the most frequent response. Looking at the pattern of the green bars, we see that people with autism are quite likely to answer with at least a “5” or higher, rising to “7” as the most frequent response. This means autistic people who take the survey strongly consider themselves nerds.
    I wanted to see if there were patterns in the way that people with ASD answered the questions of the survey. A machine learning classifier like Logistic Regression works by learning a coefficient weight for each variable in the data (e.g. for each question response), and uses the weights in a simple formula that gives a number between 0 and 1 (No or Yes predictions). With continuous values between 0 and 1, you can you treat the result as a probability relative to your other algorithmic predictions. 

    Framing the task for Machine Learning:

    A machine learning project can be described in three elements: The task, the experience, and the metric you will evaluate your algorithm performance.

    Task: Classify a survey respondent by whether they have been diagnosed with Autism or not (ASD).

    Experience: NPAS data: a corpus of survey responses where some respondents indicated a prior diagnosis of Autism.

    Performance Metric: Area Under ROC Curve (AUC)

    Data Cleaning and Transforms:

    In order to get a data set as clean as possible with minimal noise for good results, I:

    • Drop/exclude response rows that do not have complete data for any of: NPAS questions, TIPI personality inventory, or basic demographics. (Thankfully, most people answer all questions)
    • Transform categorical variables to “dummy” yes/no binary variables. Algorithm can’t tell difference between named categories, but it can tell the difference between 0 and 1 very well.
    • Calculate Big 5 personality scores based on TIPI responses and keep only resulting score variables (i.e. drop individual questions as they are then redundant and algorithm-confusing)

    Training and Testing Classification Models

    I used scikit-learn to evaluate each of the following model types:

    • Logistic Regression Classifier
    • Random Forest Classifier
    • Gradient Boosted Tree Classifier
    • K-Nearest Neighbors
    • Support Vector Classifier
    The model evaluation pipeline looks something like this:

    1. choose target (e.g. ASD yes/no)
    2. choose features (e.g. answers to NPAS survey)
    3. split the data randomly into separate chunks for training vs. testing each model
    4. cross-validate and calculate resulting ROC curves for each model
    5. generate an AUC score for each model
    Initially I used only the responses the NPAS questions as input. Meaning I excluded other demographic variables when predicting ASD. Here is a chart that shows the ROC Curves and AUC Score for each of the models with typical hyperparameters (settings):
    roc_curve ASD baseline npas only.png
    If you haven’t learned how to read ROC curves, the main takeaway here is actually that the algorithms aren’t great to start and can only predict slightly better than 50/50 chance (the straight diagonal line). The best algorithm is the line that has the most “Area Under the Curve” (AUC) between itself and that diagonal base line. In the results shown, that’s the Gradient Boosted Tree model.
    However, when I add in demographic data, the algorithms perform better, and simple Logistic Regression does very well. Here is an updated graph overlaid on the previous results showing the improvement.

    The improvement means that the machine learning classifier is picking up on the relationship to ASD in demographic questions, and by encoding that relationship as coefficient weights, it can more accurately predict whether a given respondent has ASD or not.

    I wanted to learn more about which features were helping to improve the response, so I used the very cool ML Insights package to take a look at what features the Gradient Boosted Tree was picking up on. The following chart shows the top 5 indicating variables ranked in order of effect: Q4, age, familysize, education level, and Q6. These gave the strongest discriminating signals for ASD.
    The inventory predictors (Q4 and Q6) are positively correlated with ASD diagnosis. Number of siblings is positively correlated. Years of education is negatively correlated.

    On the topic of age, the fact that this demographic question is about the diagnosis of ASD, and not the actual presence of ASD symptoms, means that people who grew up with different psychiatric practices (i.e. when Autism was less often diagnosed) would be less representative in the data.

    Big Ideas

    Personality surveys are a trove of interesting data. They can tell us a lot about ourselves!

    With enough data, we can algorithmically discern components of “nerdiness” by looking closely at how the data varies (machine learning).

    We can approximate the ways in which personality sub-types (e.g. ASD) contribute to our collective conception of a “nerd”.

    Bonus: Code

    You can check out all of the code I used to create this project in this convenient jupyter notebook here on Github.

    Dark Market Regression: Calculating the Price Distribution of Cocaine from Market Listings

    tl;dr There’s a hidden of illegal drugs: I scraped the entire “Cocaine” category, then made a bot that can intelligently price a kilo. Don’t do drugs.


    Project Objective: Use machine learning regression models to predict a continuous numeric variable from any web-scraped data set.
    Selected Subject: Price distributions on the hidden markets of the mysterious dark web! Money, mystery, and machine learning.
    Description: Turns out it is remarkably easy for anyone with internet access to visit dark web marketplaces and browse product listings. In this project I use Python to simulate the behavior of a human browsing these markets, selectively collect and save information from each market page this browsing agent views, and finally use the collected data in aggregate to construct a predictive pricing model.

    (Optional Action Adventure Story Framing)

    After bragging a little too loudly in a seedy Mexican cantina about your magic data science powers of prediction, you have been kidnapped by a forward-thinking drug cartel. They have developed a plan to sell their stock of cocaine on the internet. They demand that you help them develop a pricing model that will give them the most profit. If you do not, your life will be forfeit!
    You, knowing nothing about cocaine or drug markets, immediately panic. Your life flashes before your eyes as the reality of your tragic end sets in. Eventually, the panic subsides and you remember that if you can just browse the market, you might be able to pick up on some patterns and save your life…


    Dark web marketplaces (cryptomarkets) are internet markets that facilitate anonymous buying and selling. Anonymity means that many of these markets trade illegal goods, as it is inherently difficult for law enforcement to intercept information or identify users.
    While black markets have existed as long as regulated commerce itself, dark web markets were born somewhat recently when 4 technologies combined:
    • Anonymous internet browsing (e.g. Tor and the Onion network)
    • Virtual currencies (e.g. Bitcoin)
    • Escrow (conditional money transfer)
    • Vendor feedback systems ( ratings of sellers)
    Total cash flow through dark web markets is hard to estimate, but indicators show it as substantial and rising. The biggest vendors can earn millions of dollars per year.
    The market studied for this project is called Dream Market.
    In order to find a target variable suitable for linear regression, we’ll isolate our study to a single product type and try to learn its pricing scheme. For this analysis I choose to focus specifically on the cocaine sub-market. Cocaine listings consistently:
    • report quantity in terms of the same metric scale (grams), and
    • report quality in terms of numerical percentages (e.g. 90% pure).
    These features give us anchors to evaluate each listing relative to others of its type, and make comparisons relative to a standard unit 1 gram 100% pure.

    Browsing Dream Market reveals a few things:
    • There are about 5,000 product listings in the Cocaine category.
    • Prices trend linearly with quantity, but some vendors sell their cocaine for less than others.
    • Vendors ship from around the world, but most listings are from Europe, North America, and other English speaking regions.
    • Vendors are selective about which countries they are willing to ship to.
    • Many vendors will ship to any address worldwide
    • Some vendors explicitly refuse to deliver to the US, Australia, and other countries that have strict drug laws or border control.
    • Shipping costs are explicitly specified in the listing.
    • Shipping costs seem to correlate according to typical international shipping rates for small packages and letters.
    • Many vendors offer more expensive shipping options that offer more “stealth”, meaning more care is taken to disguise the package from detection, and it is sent via a tracked carrier to ensure it arrives at the intended destination.
    • The main factor that determines price seems to be quantity, but there are some other less obvious factors too.
    While the only raw numerical quantities attached to each listing are BTC Prices and Ratings, there are some important quantities represented as text in the product listing title:
    • how many “grams” the offer is for
    • what “percentage purity” the cocaine is
    These seem like they will be the most important features for estimating how to price a standard unit of cocaine.
    I decide to deploy some tools to capture all the data relating to these patterns we’ve noticed.


    BeautifulSoup automates the process of capturing information from HTML tags based on patterns I specify. For example, to collect the title strings of each cocaine listing, I use BeautifulSoup to search all the HTML of each search results page for 

     tags that have class=productTitle, and save the text contents of any such tag found.

    Selenium WebDriver automates browsing behavior. In this case, its primary function is simply to go to the market listings and periodically click to the next page of search results, so that BeautifulSoup can then scrape the data. I set a sleep timeout in the code so that the function would make http requests at a reasonably slow rate.
    Pandas to tabulate the data with Python, manipulate it, and stage it for analysis.

    Matplotlib and Seaborn, handy Python libraries for charting and visualizing data

    Scikit Learn for regression models and other machine learning methods.

    [Image: Automated Browsing Behavior with Selenium WebDriver]


    I build a dictionary of page objects, which includes:
    • product listing
    • listing title
    • listing price
    • vendor name
    • vendor rating
    • number of ratings
    • ships to / from
    • etc.
    The two most important numeric predictors, product quantity and quality (# of grams, % purity), are embedded in the title string. I use regular expressions to parse these string values from each title string (where present), and transform these values to numerical quantities. For example “24 Grams 92% Pure Cocaine” yields the values grams = 24and quality = 92 in the dataset.
    Vendors use country code strings to specify where they ship from, and where they are willing to ship orders to.
    For example, a vendor in Great Britain may list shipping as “GB – EU, US”, indicating they ship to destinations in the European Union or the United States.
    In order to use this information as part of my feature set, I transform these strings into corresponding “dummy” boolean values. That is, for each data point I create new columns for each possible origin and destination country, containing values of either True or False to indicate whether the vendor has listed the country in the product listing. For example: Ships to US: False
    After each page dictionary is built (i.e. one pass of the code over the website), the data collection function saves the data as a JSON file (e.g. page12.json). This is done so that information is not lost if the connection is interrupted during the collection process, which can take several minutes to hours. Whenever we want to work with collected data, we merge the JSON files together to form a Pandas data frame.


    The cleaned dataset yielded approximately 1,500 product listings for cocaine.
    Here they are if you care to browse yourself!

    Aside on Interesting Findings

    There are a lot of interesting patterns in this data, but I’ll just point out a few relevant to our scenario:
    • Of all countries represented, the highest proportion of listings have their shipping origin in the Netherlands (NL). This doesn’t imply they are also the highest in volume of sales, but they are likely correlated. Based on this data, I would guess that the Netherlands has a thriving cocaine industry. NL vendors also seem to price competitively.
    • As of July 15th, 2017, cocaine costs around $90 USD per gram. (median price per gram):

    • Prices go up substantially for anything shipped to or from Australia:
    * charts generated from data using Seaborn


    In order to synthesize all of the numeric information we are now privy to, I next turn to scikit-learn and its libraries for machine learning models. In particular, I want to evaluate how well models in the linear regression family and decision tree family of models fit my data.

    Model Types Evaluated

           Linear Models
    • Linear Regression w/o regularization
    • LASSO Regression (L1 regularization)
    • Ridge Regression (L2 regularization)
      Decision Tree Models
    • Random Forests
    • Gradient Boosted Trees
    To prepare the data, I separate my target variable (Y = Price) from my predictor features (X = everything else). I drop any variables in X that leak information about price (such as cost per unit). I’m left with the following set of predictor variables:

    X (Predictors)

    • Number of Grams
    • Percentage Quality
    • Rating out of 5.00
    • Count of successful transactions for vendor on Dream Market
    • Escrow offered? [0/1]
    • Shipping Origin ([0/1] for each possible country in the dataset)
    • Shipping Destination ([0/1] for each possible country in the dataset)

    Y (Target)

    • Listed Price
    I split the data into random training and test sets (pandas dataframes) so I can evaluate performance using scikit-learn. Since I can’t fully account for stratification within the groups that I’m not accounting for, I take an average of scores over multiple evaluations.
    Of the linear models, simple linear regression performed the best, with an average cross-validation R^2 “score” of around 0.89, meaning it accounts for about 89% of the actual variance.
    Of the decision tree models, the Gradient Boosted trees approach resulted in the best prediction performance, yielding scores around 0.95. The best learning rate I observed to be 0.05, and the other options were kept at the default setting for the sci-kit learn library.
    The model that resulted from the Gradient Boosted tree method picked up on a feature that revealed that 1-star ratings within the past 1 month were charateristic with vendors selling at lower prices.

    Prediction: Pricing a Kilogram

    (Note: I employ forex_python to convert bitcoin prices to other currencies.)

    I evaluate the prediction according to each of the two models described above, as well as naive baseline:

    1. Naive approach: Take median price of 1 gram and multiply by 1000.
      • Resulting price estimate: ~$90,000
      • Review: Too expensive, no actual listings are anywhere near this high.
    2. Linear Regression Model: Fit a line to all samples and find the value at grams = 1000.
      • Resulting price estimate: ~$40,000
      • Review: Seems reasonable. But a model that account for more variance may give us a better price…
    3. Gradient Boosted Tree Model: Fit a tree and adjust the tree to address errors.
      • Resulting price estimate: ~$50,000 (Best estimate)
      • Review: Closest to actual prices listed for a kilogram. Model accounts for most of the observed variance.


    Darknet markets: large-scale, anonymous trade of goods, especially drugs. Accessible to anyone on the internet.

    You can scrape information from dark net websites to get data about products.

    Aggregating market listings can tell us about the relative value of goods offered, and how that value varies.

    We can use machine learning to model the pricing intuitions of drug sellers.

    (Optional Action Adventure Story Conclusion)

    The drug cartel is impressed with your hacking skills, and they agree to adjust the pricing of their international trade according to your model. Not only do they let you live, but to your dismay, they promote you to lieutenant and place you in charge of accounting! You immediately begin formulating an escape in secret. Surely random forest models can help…

    Review: The Information: A History, A Theory, A Flood

    The Information: A History, a Theory, a Flood by James Gleick

    The Information: A History, a Theory, a FloodA thorough exploration of information theory, how communication functions at its most fundamental. Language, mathematics, cryptography, memory, computing, the history of telecommunication, the history of intuitive human information theory before and after it was formalized.

    Most intriguing is the third of Gleick’s informational themes: the Flood, our modern immersion in quantifiable information. The book ends with allusions to Borges’ “The Library of Babel”, a short story that seems ever more apt and readily appreciable as time goes on.

    Freeman John Dyson has written a substantive synopsis and review entitled “How We Know”, here:…

    View other book reviews

    Review: Physics of the Future: How Science Will Change Daily Life by 2100

    Physics of the Future: How Science Will Shape Human Destiny and Our Daily Lives by the Year 2100 by Michio Kaku
    I’m not sure about anybody’s ability to predict a century into the future (especially if you give credence to the idea of accelerating returns in technology), but I was willing to give this book a shot after hearing Michio Kaku in interviews. In particular he piqued my curiosity with the claim that all the ideas in the book are grounded on currently existing prototypes or established scientific theory.

    Physics of the Future: How Science Will Shape Human Destiny and Our Daily Lives by the Year 2100Now after having read it, I think Michio is only giving a survey of some select topics, and the only ones that I think he handled well were the ones most closely linked to physics (e.g. space travel, nanotechnology & quantum behavior, global power generation). The other fields he dives into, particularly his analysis and extrapolation of consumer technologies, were disappointingly off target or lacking in proper depth.

    The book is occasionally so superficial in its treatment of some prototyped technologies that it reads somewhat like painfully outdated sci-fi from Michio’s childhood in the 50’s. The book is written to be highly accessible, but he does uninformed readers a disservice by giving equal weight to illogical and ‘improbable but not impossible’ possibilities.

    My biggest problem with his predictions are that they center on only a set of technologies that Kaku has experience with, extrapolated all the way out to 2100 without much consideration of how all the unmentioned possibilities will change his visions for the future.

    As an example, Michio doesn’t do the best job keeping our present circumstances and his far future predictions from mixing anachronistically: e.g. the frequently repeated “…when we carry around our own genomes on a CD-ROM” for a “2030-2070” range prediction. I worry that Michio Kaku is just paraphrasing some of the ideas out there without really thinking about them any more critically, like a mediocre science journalist or sci-fi writer. Again this could be an artifact of his intentionally writing this book to be broadly accessible, but I don’t think he found the right balance.

    The best parts of the book are in the second half, particularly his chapters on The Future of: Energy, Space Travel, Wealth, and Humanity (respectively) and I did enjoy most of this material despite a scattering of the same problems mentioned above. Sadly, I think Michio Kaku completely botched his concluding chapter, “A Day in the Life in 2100”, and I think the preceding Future of Humanity chapter would have been a much stronger ending. The “Day in the Life” conclusion is silly speculative fiction and the best (worst?) example in the entire book of his anachronistic and muddled sci-fi visions.

    Michio Kaku is great when talking about physics and large scale trends closely linked to humanity’s knowledge of physics, but judging from this book alone he doesn’t put together upcoming technologies into realistic or compelling future scenarios very well at all, ending up with an incomplete picture somewhere in the uncomfortable border between imaginative thinking and unwarranted speculation.

    Other reviews

    The Adumbration

    The Adumbration lumbered in the distance, outline clouded by horizon’s haze.

    “What is it?” the countryside cried, “What will it mean?” the elders worried.

    Its dusty gray obscuring more and more into the clear blue sky, the Adumbration loomed.

    “What should we feel?” the townspeople asked, “Every man think for himself!” the scattering elite replied.

    Indiscernible, nearly there but not yet here, almost not yet a thing unto itself, the Adumbration rust the land.

    “We must know if we are to go on!” the people plead, “We can’t really tell.” duly unspoken.

    Coming still, roiling yet becalmed, ephemeral but always, the Adumbration was unseemly.

    “We will distract ourselves with seeking” some said, “We will distract ourselves with providing.” said some.

    Almost invisible the Adumbration stayed.

    “Now we are certain!” proclaimed the everyman. “Now we are content.” thought his mind.

    The Adumbration continued.

    Questioning the Answers


    Why would computers deprive us of insight? It’s not like it means anything to them…

    Surreal story time! The setting: Cornell University. Fellow scientists Hod Lipson and Steve Strogatz find themselves thinking about our scientific future very differently in the final story of WNYC Radiolab’s recent Limits episode. In the relatively short concluding segment, “Limits of Science”, Dr. Strogatz voices concern about the implications of automated science as we learn about Dr. Lipson’s jaw-dropping robotic scientist project, Eureqa.

    I can relate with Steve Strogatz’ concern about our seemingly imminent scientific uselessness. But is there actually anything imminent here? Science is the language we use to describe the universe for ourselves. Scientific meaning originates with us, the humans that cooperate to create the modal language of science. What are human language or ‘meaning’ to the Eureka bot but extra steps to repackage the formula into a less precise, linguistically bound representation? If one considers mathematics to be the most concise scientific description for phenomena, hasn’t the robot already had the purest insight?

    Given the sentiments expressed by Dr. Strogatz and Radiolab’s hosts Jad and Robert, it’s easy to draw comparisons between Eureqa and Deep Thought (the computer that famously answered “42” in The Hitchhiker’s Guide to the Galaxy). Author Douglas Adams was brilliant satirist as much as prescient predictor of our eventual technological capacity (insofar as Deep Thought is like Eureqa). The unfathomably simplistic answer of “42” and the resulting quandary that faced the receivers of the Answer to Life, the Universe, and Everything in HHGTTG is partially intended to make us aware that we are limited in our abilities of comprehension.

    More importantly, it shows that meaning is not inherent in an answer. 42 is the answer to uncountable questions (e.g. “What is six times seven?”) and Douglas Adams perhaps chose it bearing this fact in mind. Consider that if the answer Deep Thought gave was a calculus equation 50,000 pages long, the full insight of his satire might be lost on us; it’s easy to assume an answer so complicated is likewise accordingly meaningful, when in fact the complex answer is no more inherently accurate or useful in application than the answer of 42.

    Deep Thought

    The Eureqa software doesn’t think about how human understanding is affected by the discovery of formula that best describe correlations in the data set. When Newton observed natural phenomena and eventually discovered his now eponymous “F = ma” law, he reached the same conclusion as the robot; the difference is that Newton was a human-concerned machine as well as a physical observer. He ascribed broader meaning to the formula by associating the observed correlation to systems that are important for human minds, the scientific language of physics, and consequently engineering and technology. A robotic scientist doesn’t interface with these other complex language systems, and therefore does not consider the potential applications of its discoveries (for the moment, at least). 

    Eureqa doesn’t experience “Eureka!” insight because it isn’t like Archimedes, Man. Man so thrilled by his bathtub discovery of water displacement that legend remembers Archimedes as running naked through the streets of Syracuse. He realized that his discovery could be of incalculable importance to human understanding. It is from this kind of associative realization that emerges the overwhelming sense of profound insight. When Eureqa reaches a conclusion about the phenomena it is observing, it displays the final formula and quietly rests, having already discovered everything that is inherently meaningful. It does not think to ask why the conclusion matters, nor can it tell as much to its human partners.

    “Why?” is a tough question; the right answer depends on context. Physicist Richard Feynman, in his 1983 interview series on BBC “Fun to Imagine”, takes time for an aside during a question on magnetism. When asked “Why do magnets repel each other?”, Feynman stops to remind the interviewer and the audience of a critical distinction in scientific or philosophical thinking: why is always relative.

    “I really can’t do a good job, any job, of explaining magnetic force in terms of something else that you’re more familiar with, because I don’t understand it in terms of anything else that you’re more familiar with.” – Dr. Feynman

    Meaning is not inherent or discoverable; meaning is learned.

    Making Virtual Sense of the Physical World

    You’ll remember everything. Not just the kind of memory you’re used to; you’ll remember life in a sense you never thought possible.

    Wearable technology is already accessible and available to augment anyone’s memory. By recording sensory data we would otherwise forget, digital devices enhance memory somewhat like the neurological condition synesthesia does: automatic, passive gathering of contextual ‘sense data’ about our everyday life experiences. During recollection, having the extra contextual information stimulates significantly more brain activity, and accordingly yields significant improvements in accuracy.

    This week, Britain’s BBC2 Eyewitness showed off research by Martin Conway [Leeds University]: MRI brain scan images of patients using Girton Labs Cambridge UK‘s “SenseCam”, a passive accessory that takes pictures when triggered by changes in the environment, capturing momentary memory aids.

    The BBC2 Eyewitness TV segment on the SenseCam as a memory aid:

    The scientists’ interpretation of the brain imaging studies seems to indicate that vividness and clarity of recollection is significantly enhanced for device users, even with only the fragmentary visual snapshots from the SenseCam. One can easily imagine how a device that can also record smells, sounds, humidity, temperature, bio-statistics, and so on could drastically alter the way we remember everyday life!

    Given this seemingly inevitable technological destiny, we may feel the limits of human memory changing dramatically in the near future. Data scientists are uniquely positioned to see this coming; a recent book by former Microsoft researchers Gordon Bell and Jim Gemmell, Total Recall: How the E-Memory Revolution Will Change Everything, begins its hook with “What if you could remember everything? Soon, if you choose, you will be able to conveniently and affordably record your whole life in minute detail.

    When improvements in digital interfacing allow us to use the feedback from our data-collecting devices effortlessly and in real-time, we might even develop new senses.

    A hypothetical example: my SkipSenser device can passively detect infrared radiation from my environment and relay this information, immediately and unobtrusively, to my brain (perhaps first imagine a visual gauge in a retinal display). By simply going through my day to day life and experiencing the fluctuations in the infrared radiation of my familiar environments, I will naturally begin to develop a sense for the infrared radiation being picked up by the device. In this hypothetical I might develop over time an acute sense of “heat awareness”, fostered by the unceasing and incredibly precise measurements of the SkipSenser.

    Of course I’m not limited to infrared radiation for my SkipSenser; hypothetically anything detectable can stimulate a new sense. The digital device acts as an aid or a proxy for the body’s limited analog sense detectors (eyes, ears, skin, i.e. our evolutionary legacy hardware) and also adds new sense detectors, allowing the plastic brain to adapt itself to new sensory input. I could specialize my auditory cortex, subtly sensing the characteristics of sound waves as they pass through the air, discovering patterns and insights previously thought too complex for normal human awareness. I could even allow all of my human senses to slowly atrophy in favor of fomenting a set of entirely unfamiliar senses, literally changing my perception to fit some future paradigm.

    NASA Interferometer Images

    Augmenting our sensory systems isn’t new, it’s what humans are naturally selected for. Generally speaking, ‘tool’ or ‘technology’ implies augmentation. If you drive a car, your brain has to adapt to the feel of the steering wheel, the pressure needed to push the pedals, the spatial dimensions of the vehicle, the gauges in the dashboard. While you learned how to drive a car (or ride a bike), your brain was building a neural network by associatively structuring neurons, working hard to find a system good enough to both A) accurately handle these new arbitrary input parameters and B) process the information at a rate that allows you to respond in a timely fashion (i.e. drive without crashing). That ability to restructure based on sensory feedback is the essence of neuroplasticity; it’s how humans specialize, how humanity shows such diverse talent as a species.

    That diversity of talent seems set to explode because here’s what is new: digital sensors that are easy to use, increasingly accessible, and surpassing human faculty. Integrated devices like the SenseCam continue to add functionality and shrink in size and effort required, now encompassing a sensory cost-benefit solution that appeals not only to the disabled, but to the everyman.

    There may be no limits to the range of possible perception. Depending on your metaphysical standpoint, this might also mean there may be no limits to the range of possible realities.

    Why it feels like Easter time

    Two quick Easterly follow-ups to the thought a few days ago on April Fools’ Day as a holiday in celebration of the vernal equinox (i.e. spring).

    • The vernal equinox, I’ve since learned, can be considered either the ‘first day of spring’ or the ‘middle of spring’ for the northern hemisphere depending on your perspective (ground temperature change versus scientific equinox of when sunlight is at a precise midpoint on the earth’s surface, respectively).
    • It’s Easter! Why is it “Easter”? Easter is a critically important religious holiday for Christian faiths. So why not call it Resurrection Day (a few do), or the Festival of the Ascendance, or Jesus April Fools Day? According to the Oxford English Dictionary’s “Etymologically, ‘Easter’ is derived from Old English. Germanic in origin, it is related to the German Ostern and the English east. [Bede] describes the word as being derived from Eastre, the name of a goddess associated with spring.” So, at least in name if not spirit, Easter has strong ties to the season of spring.

    Ok, one more:

    • Easter Bunnies and Easter eggs came into the picture about a millennium and a half after the holiday got its roots, around the 1600’s in medieval Germany (the Holy Roman Empire). Originally, the German tradition of bringing eggs was not linked to Easter, nor were the eggs edible. America especially liked the tradition and adopted it from German immigrants (similar to the idea of Kris Kringle) and in the modern era the Easter bunny and colorful eggs are the ubiquitous symbols of a secularized Easter. This linking of imagery was not threatening to the Christian churches because bunnies and eggs are ancient symbols of fertility. From Wikipedia: “Eggs, like rabbits and hares, are fertility symbols of extreme antiquity. Since birds lay eggs and rabbits and hares give birth to large litters in the early spring, these became symbols of the rising fertility of the earth at the Vernal Equinox.”

    I’ll close with an intriguingly opposed perspective (so to speak) from an Australian social researcher, Hugh Mackay, on Easter:

    “A strangely reflective, even melancholy day. Is that because, unlike our cousins in the northern hemisphere, Easter is not associated with the energy and vitality of spring but with the more subdued spirit of autumn?” – Hugh Mackay


    The verb “unfriend” is the New Oxford American Dictionary’s Word of the Year for 2009. As in, “I decided to unfriend my roommate on Facebook after we had a fight.”

    The official lexicographer has an interesting albeit brief quip about the word relating to modern technology and how its “un-” prefix is un-usual. Why is “unfriend” such a well-known trend this year? The concept of severing a form of communication with another isn’t new to the Internet, let alone social networking sites which already host millions of people “friending” and “unfriending” each other. I suppose a better question, then: why is “unfriending” now more frequent or more public?

    Perhaps unfriending is merely pruning the branches of a tree that cannot support every leaf. Abscission cells cut once-invaluable leaves and let them fall away so that the tree itself may continue to exist. A network with too much irrelevant input yields noise that drowns out the data from the smaller networks that we can afford to sustainably comprehend, threatening the communicative value of the whole.

    The idea of unfriending seems appropriate and reasonable given the above assumption. Why then is “unfriending” the word of the year this year in particular? Perhaps it is because this year we have reached a time when enough people have reached a point of social network saturation and feel the need to publicly acknowledge as much.

    When we announce that we have completed the process of “unfriendship” to the unpruned branches of our network, the remaining individuals are inclined to feel the pride of group-inclusion, even if they paid little attention to the communication of that network to begin with.

    We boom with the idea of Internet-based social networking, extending to touch as many others as we can reach given the ease in establishing a connection. Now we’re self-correcting, acknowledging the effort required to handle the sheer amount of potential content available to us. In order to actively engage, we must actively disengage. Focus makes it easier to contribute effectively.

    Twitter’s recently announced “Lists” feature is a euphemistic twin of the “unfriend” trend. Rather than applying exclusion to a master list of friends a la Facebook, Twitter is encouraging inclusion in tailored lists to funnel the ever-rising torrent of content. Both apply a similar mechanism to allow us to gain some noise control over our cultural vantage points.

    Now given that there are inclusive and exclusive methods for social noise reduction, why is the exclusive term Oxford’s Word of the Year? I suspect the eagerness to advertise the abscissive act of unfriending, a once-hushed topic now more widely socialized, has led to relaxed inhibitions regarding the use of the negative term and a desire to express and mentally justify what would otherwise be a hidden work of effort.

    The Beginning of an Idea

    How can two identical communications be perceived very differently by the same person?

    Take the phrase “I love you!” and send it in two separate emails. The data is encoded and represented identically in both communications. A person reading the emails won’t be able to differentiate between the emails when looking at their raw data.

    Now consider the phrase “I love you!” sent to you in two separate emails: one from your significant other, and one from an unlikable business acquaintance. Both emails have only the phrase “I love you!” and no other content, but your emotional response changes dramatically once a key piece of information is known: the originator. Your initial emotional response, contextual interpretation, and even the voice you use to read the message in your head change once you can imagine who sent the message. Knowledge of origination is a key component of interpretation.

    Curiously, in the long term humans tend to remember important ideas and the emotions attached to them, but rarely their trigger. Even when the information source is relevant for credibility, over time our recollection tends to reinforce concepts without retaining contextual information. The mind weakens the memories of source and eventually assumes credibility of content. Even a story you were initially unsure about in an obscure Wikipedia page soon becomes a fact “you read about somewhere” over time, and only when the information is questioned by a confident opponent does the source again become relevant in the mind.

    Misinformation is spread easily because of our limited capacity to communicate publicly and our consequent inclination to trust unchallenged information. Without implicit trust we would be eternal skeptics and unable to live in a world full of assumptions that allow civilization.

    More fully understanding the originator of the message when the message is interpreted yields more accurate mental context for the knowledge being stored. Politics are a clear example of how competing factions strive to discredit the candidate rather than the candidate’s specifically disagreeable viewpoints. When a source is deemed untrustworthy, further communication is weakened. A source deemed trustworthy, especially one whose interpretations remain uncontested over time, has considerably more power with subsequent communication.

    The ability to verify the identity of the person you are talking to, or generally speaking the source of your information, is important in all communication technologies. Advances in forgery techniques are continually met with counter-measures from governments and watchdog groups, spurred by societal pressure to utilize the full power of the network good that is an information technology. At the time of this writing it’s possible spoof an email address and convincingly pretend to be another individual through text-based communication. It’s not quite as easy to mimic a person’s voice or visage, however, so these forms of communication have an additional element of implicit trust.

    In the near future voices and faces will be easy to recreate and even fabricate from scratch. Biometrics and more detailed data regarding our identity (DNA, for example) seem the most likely next steps in the short-term, but I am uncertain how we will retain individual identity in a completely virtual environment of the future Internet. I suppose by the same token I can’t assume that there will still be value in individuality for such an interconnected future. Perhaps knowing the source of communication as obviously as the content of communication is societal self-awareness.