Domain Literacy Updates

This is a listing of updates I've made to this project. It is meant to be a journal of what I've done, published here so that I can remember what I have done, and what I was thinking. I am sorting this in a reverse cronological order, allowing for anyone to start with the first update, and scroll down to explore what has been done. The atom feed for the project is in cronological order, allowing for anyone to receive the latest updates.


The Power Of The Web Domain

Domain literacy is one of my new causes in 2017. We'll see how successful I am convincing people of what domain literacy is, what the benefits are, and the power of the domain at this point in time. Web domains like Twitter.com and Facebook.com are influencing the world in ways we haven't even imagined. 

Think about the power wielded by Twitter and Facebook right now. The collective energy they wield to shift markets, influence opinion, and change how the world works. Or, do they? They only do because we operate within their domain. We go to Twitter or Facebook.com, and have them in our pockets via mobile applications. We give them this power. What if we chose not to? What if we did not use Twitter.com or Facebook.com? 

With each message, image, video, and like we've incrementally given power to these domains, and in turn given power to those who have figured out how to dominate within these domains through the replication, automation, and application of their ideology. We tend to mistake this for some Internet-enabled democracy, where in reality those with the know-how, and compute power can out-blast, troll, and silence those who do not. 

I'm fascinated by the power we've given to these web domains in just a decade. How much they've captured our attention, and how we've accepted them as the way things are, and something that is inevitable. While I do not think we can fully escape the powerful effects of these domains, I do think that we can maintain our own domains, frequent and support domains that we trust and believe in, and minimize the power we bestow to some of the domains that are destabilizing our world right now.

Detail Page


Some Benefits Of Basic Domain Literacy

Further evolving what I mean when I mean when I say "domain literacy", and wanted to brush up on the benefits of domain literacy are, which is why this is an area I'm focusing more attention on in 2017. It is important to me to be able to articulate what I mean by domain literacy because it is a potentially complex, multi-dimensional concept that impacts our physical as well as our digital worlds. 

While there is no antidote for everything that ills us on the web these days, I feel like domain literacy brings some interesting benefits to the table that can help protect the average person from two of the most dangerous things that we face on the web right now:

  1. Phishing - If your are domain literate, you will always right click on ANY link in an email before clicking, closing up the number one way that hackers, and cyber(in)security specialist get into secured networks and systems.
  2. Disinformation - If you are domain literate, you will know every share on Facebook links to a domain, and understand that there is wider context beyond the catchy title, description, and image of the Facebook "card" being shared. 

There are many challenges with ensuring a large percentage of the population meets a baseline domain literacy, and there will be other benefits beyond these, but I feel like we have to begin somewhere. These are two of the most important tools in any cyber(in)security specialist's toolbox. It is how the average person is compromised digitally and misled emotionally, resulting in a very malleable, and exploitable individual.

I am not suggesting that everyone should be fully aware of each web domain of everything they use and share daily, or the inner workings of DNS, I am just suggesting that we draw a baseline of what is domain literacy, and identify what we'd like to accomplish with this definition. If we can help enough folks meet a baseline definition of domain literacy, I can't help but think we could shift the current cyber(in)security environment significantly and make for an incrementally healthier online environment for everybody.

Detail Page


What I Mean When I Say Domain Literacy

Online web domains are an increasingly important aspect of our daily business and personal lives. I get that the average folk could care less about domains, DNS, and the nuts and bolts of how the web works, but after this election, and as more of our personal and professional lives move online, I feel like folks ignore some of the deeper details at their own peril. In 2016, you either work on someone else's farm (domain), or you work on your own, and increasingly folks are operating their personal lives and business worlds entirely on someone's else's domain.

I do not expect folks to understand domains at a very technical level, but I'm working to develop a baseline expectation of what people should know, and evolve this into a coherent definition of what domain literacy means to me. Domain literacy for me means that an average citizen should have the following awareness:

  • Domains Exist - A basic knowledge that domains exist, they there are many different top level domains, and be able to look at the address bar in their browser and make sense of the domain they are operating within.
  • Domain Due Diligence - Have a basic awareness that there are different entities behind domains, ranging from individuals to corporate, institutional, and government own domains. Ideally they also have basic knowledge of how to conduct a little due diligence to understand who is behind (ie. Whois, Business Search).
  • Domain Experts - Understand which domains are goto when it comes to finding domain experts. Ok, this is where the meaning starts to morph and bend, but I think contributes to the depth of domain literacy, and contributes to the importance of critical thinking as part of developing and strengthing domain literacy.
  • Operate Your Own Domain - That is possible for ANYONE to purchase, and operate their own domain on the Internet, and that this means more than just having a blog or an e-commerce site. Ideally, there is a basic understanding of where to purchase and host your domain, even if it is with an existing service provider like WordPress, Wix, or Reclaim Hosting.
  • Reclaim Your Domain - It is important for people to understand that they have control over their accounts and presence in other domains. That they can download their data, access via APIs, and use in services like Zapier, and even delete their presence within any domain--if they can't, they shouldn't be operating there.
  • Safely Operate In Variety of Domains - Even for those of us who operate heavily within our domains, the reality is that we will always have to operate in 3rd party domains, either mandated by work, our schools, government agencies, or just because it is the popular place to be for fun or business. This is where the average citizen needs a basic level of awareness about operating safely and securely in each domain whether it is Facebook, Twitter, or your banking applications while protecting your own best interface.

These are the core elements of domain literacy in my opinion. It may sound like a lot to ask fo the average citizen, but I don't think it is much different than basic security, safety, common sense, and financial literacy required in the real world. You don't walk into shady establishments in the physical world, and hand over your private information to people you don't know or trust--we just need to help make people more of aware of the details of doing this in the digital, as well as our physical worlds.

This definition is a work in progress for me. This is the first time I've tried to define as a simple outline. I will work to keep refining, and hopefully also provide some basic exercises that people might be able to engage in, to strengthen their awareness when it comes to domains. This stuff will become increasingly important in the future. It will determine whether you are well informed during elections, as well as in control of your finances, the value generated by your own work, as well as your privacy and safety in an increasingly volatile online environment.

Detail Page


Medium And The Importance Of Maintaining Your Own Domain

I wanted to use the recent news about Medium downsizing as an opportunity to educate folks about the importance of maintaining your own domain. I like Medium. I am not as excited about it as some folks are, but I see enough value there that I make sure and make sure it is one of the channels I tend to on a regular basis. However, as I've discussed before, its important to weigh the pros and cons of how much you depend on 3rd party platforms and services for essential pieces of your online presence--like your blog.

I am always thinking deeply about which online services I adopt. Balancing my needs, my budget, how much control I have over my data, content, and algorithms, while also working to understand the motives of each platform, product, and service they offer. I find value in operating on Medium and have even showcased some API provider's usage of the platform for their blog presence, and Medium's own approach to delivering their API. However, I've always been skeptical about Medium's viability, motivations, and what the future might hold.

We should not stop playing with new services, and adopting those that add value to what we are trying to accomplish online, but we should always consider how deeply we want to depend on these companies, and be aware that their VC-fueled objectives might now always be alignment with our own. It is a good time to focus on this topic as we ponder the future of Medium, but I wanted to beat this drum again mainly because of the number of folks who felt they needed to tell me in 2016 that I should move my blog entirely to Medium, without considering that impact to my operations--it is cool man!

I do not condemn folks running their blog on Medium, but at a minimum, you should make sure and set up your own subdomain, otherwise you are handing over all your content, power, and control to Medium. If you are blogging for fun, or just as a side to your career, this might not be a problem, but if you are like me, and depend on your blog to pay your rent, you have to put more thought into where your blog operates. I enjoy the network effect of Medium, but I also enjoy the 5-10K my blog makes each month through sponsorship and content creation--something I have been able to cultivate because I'm maintained full control over my operations for seven years now.

Startup centric folks love to push back on this way of thought, as they prefer all of us to be dependent on them, regardless of their objectives, exit strategies, or high risk of failure. I'm perfectly happy to enter into partnership arrangements with platforms that bring value, but I want to make sure I can always get my data in, and my data out, and make sure all public URLs are reachable via a domain I have DNS control over. I'm sorry, its just good business. In 2016, either you are working on someone else's farm (domain), or you are working on your own, enjoying the fruits of your labor, and profiting from the value you generated on a daily basis.

Detail Page


What Is Fake News?

I spent some time studying the "fake news" problem over the holidays to prepare me for speaking intelligently on the topic. I fired up a bunch of Amazon servers, gathered a bunch of data about what has been going on, but as of this weekend I put a pause on the work, publishing what I had gathered to Github, and set the project on the back burner to simmer for a while.

One of the realizations I had while doing this work was of the limitless depths of what "fake news" can be. I looked at the home pages of 350 disinformation sites this weekend, as I studied the results of this "fake news" harvesting via the Twitter API. One of the central actors in my work this weekend was Fidel Castro, who just happened passed away. He was on over 90% of the websites I looked at over the weekend, with a diverse range of positions from the left, to the alt-right--Fidel became the poster child for my "what is fake news" thought process.

In the current online environment, separating out "fake", "propaganda", and "marketing" is nearly impossible. While I do not subscribe to many of the conspiracy theories about leading news and media outlets, I would put out there that we all suffer from a significant lack of trust in these critical institutions. From Russia to Cuba, to Iran, Egypt, Iraq, and on, and on...think about the "fake news" that has been put out there by these outlets. Fake news and disinformation is nothing new--we just have a more automated, real time, and algorithmically controlled edition that exists in everyone's pockets, automobiles, and homes today.

So far, all roads with my fake news and domain literacy work lead me back to my earlier digital literacy work. My time, energy, bandwidth, and compute power should be spent helping folks develop their awareness of the online world emerging around us. I will keep developing my domain literacy codebase for profiling the domains I feed it but will be emphasizing domain literacy over "fake news". Investing my time into quantifying, connecting, and defining who is behind domains, and educating average individuals about owning their own domain, and being more aware of the pros and cons of operating within, sharing, and engaging with other major and minor online domains that exist online today.

Detail Page


The Assholes Are Better Equipped When It Comes To Technology

I spent some time studying the "fake news" problem over the holidays to prepare me for speaking intelligently on the topic. I fired up a bunch of Amazon servers, gathered a bunch of data about what has been going on, but as of this weekend I put a pause on the work, publishing what I had gathered to Github, and set the project on the back burner to simmer for a while.

After running around 30 separate servers for a week, I felt like these resources could be better spent educating people about the digital world around us, not battling the well-equipped, and ever evolving disinformation groups that seem to thrive online. It is no secret that white men are good at the Internet, something that became even more evident with the whole gamergate shitstorm, but is something that has reached alarming levels during this election. After evaluating almost 350 domains engaging in disinformation campaigns, one thing became very clear--white angry men are good at the Internet, and amplifying the (dis)information that supports their view of the world.

From a technical standpoint, I'm perfectly happy mapping out these domains, understanding who is behind (or not) each of these efforts, and the organized search and social efforts behind them. From a human standpoint, I can't keep wading through their hatred, and validating that yet another site is spreading hate, and that spending my money on compute and storage capacity, to map out, define, and quantify this world is not the healthiest and most sustainable way forward for me.

I am better off educating one or two people at a time about being more critical in how we use the Internet. Helping the average individual establish, define, and defend their own domain(s), while also learning to sensibly operate in other leading domains like Facebook, Twitter, YouTube, and beyond. I do not feel like there is a technological fix to get us out of the "fake news" situation. I feel like this needs to be a human solution, and my time is better spent helping contribute to digital literacy, and not engaging with or defining the worst domains in the Internet realms.

Detail Page


I Spent Some Time Evaluating The Fake News Problem

There was a lot of fake news swirling around during the election, and like I do with other ways that technology (APIs) are impacting our world, I wanted to better understand it before I opened my mouth telling people there was or was not a problem here--let alone talk about any possible solution. Building on my friend Mike Caulfield's work, I set out to understand how "fake news" was being shared, compared to "regular news".

I seeded my list of "fake news" domains with just a handful that are making the rounds in the press and kept adding to it, resulting in 327 separate domains I was evaluating as of yesterday when I spun down the servers. I fired up an Amazon server and began pulling all of the URLs that make up these sites, and passing those URLs to Facebook so that I could better understand how these sites were being shared, compared to other more recognized news outlets.

Establishing a list of these domains, and harvesting their sites isn't that hard, but understanding their virality on Facebook and Twitter is more costly, as these platforms are in the business of monetizing this information, which contributes to why this is a problem in the first place. I was able to get around some of the limitations of the Facebook and Twitter APIs by launching many different servers (30 as of yesterday), that allows me to pull data with a unique IP address--something that is proving to be more costly than any value I generated.

I am shutting down the servers, and leaving my research on Github for others to build on. I am just unsure of what would be next, uneasy about spending money I do not have on this, and be in the business of generating lists of domains that say this is "good" or this is "bad". The reasoning behind these websites spreading information or disinformation vary, and honestly it is very difficult to draw a line regarding what is good or bad--something I'm not very interested in spending my time doing. I will keep working on evolving the code for this project to keep profiling domains, hopefully providing a fingerprint of healthy and unhealthy behavior.

Honestly, this world is very very toxic. These folks are obviously very scared of people of color, government, god, and much, much more, and I'm not really interested in spending my days wading through this stuff. In my opinion, there is no technological solution to this problem. We can encourage the platforms to filter better, and the advertising networks to cut off the revenue generation, but they will continue. This group of disinformation peddlers are extremely resourceful and will find new domains if they are blacklisted, leverage alternative advertising networks when they are cut-off, and launch new social media accounts when cut-off. In short, it would be technology whack-a-mole, and I'm not interested in playing.

The diversity in the number of sites I came across presented the biggest challenge, with the only common goal across them being capitalist in nature. Many sites speak to the alt-right, or right, but many were crafty at finding affinity with often left-leaning causes like herbal products, marijuana, aliens, and beyond. They are all very search engine optimization (SEO), and social media marketing (SMM) savvy. They leverage all the top social and seo services, and were obviously very adept at purchasing new domains, and scaling content generation, cross-posting, and gaming Google, Facebook, and Twitter along the way. You could dedicate your life to counteracting this world, and die having never accomplished your mission. 

There is plenty of interesting work to be conducted in this area, but it is all work that will cost money to fire up servers, crunch, store, and process the data. Being an independent operation I just can't afford spending money doing this, and after spending upwards of $500.00 on computing and storage costs, I didn't see any light at the end of the tunnel. I'd be happy to continue indexing domains, or evaluating which social services and advertising network these disinformation sites are using, but I can't continue doing it without any funding assistance, I have better things to spend my time and money on, that provide a measurable impact on digital literacy.

In the end,we have to focus on a more digitally literate society. People who are willing to question who is behind any news item they share. Who individually is responsible any piece of information, as well as which company's, government agencies, and ideology exists behind anything being shared. I just do not feel like technology can get us out of this mess. It is humans and educated humans that can get us out of this quagmire. A lack of education is why people voted for Donald Trump, and it is why they believe in fake news, propaganda, and disinformation. No amount of filters, or domain black or white lists will make that better--we need people to be curious, inquisitive, and to want to understand what is behind. 

I'm hoping others will continue to work on other data projects and tooling to help push back on this problem. I want all of us to push back on Facebook and Google to help provide solutions. Ultimately I do not hold out much hope because it is the advertising driven incentive model that will drive this. The clueless white dudes behind these efforts, like the guy who NPR found to be behind the Denver Guardian fake new site, are the problem...they don't care about left or right, they care about making money. This is why fake news, cybersecurity, and any other cesspool of the Internet will keep bubbling up and burning all of us--until we address this incentive model for building out the Internet, very little will change.

Detail Page


WHOIS Lookup For Each Fake News Domain

This work is all about domain literacy, so I needed to profile each of the domains included in this project. I am using the WhoisXMLAPI to pull the details behind each domain. It's not a surprise, but all of the news domains provide publicly available details on individuals and company behind, where about 90% of the fake and propaganda news sites protect this information. 

More manual work will be needed to profile each of the news I'm targeting. Pulling WHOIS information just represents what I could do automatically. I'm only publishing the name, organization, and city, state, and country for each domain--if its present. I'm unsure where I will go with this area of the research next. I want to show more details about who is behind each news site and include in the available "domain literacy" details, but will need to think about further--I am guessing much of this work will be manual.

My goal is to provide enough data, so that some sort of fingerprint can be established for real news vs. propaganda and fake news. There are many shades to this discussion, and ultimately I want to leave it up to each individual toolmaker who builds upon this data, and the end-user of these tools to make the decision for themselves. I'm not in the business of identifying fake news, just using APIs to provide more details about each domain, and what they are up to.

Detail Page


Discover New Propaganda Domains Using Twitter

Right now I am just manually adding new domains to this tracking system. I wanted a way to dynamically discover potentially new domains that are spreading propaganda. The way the Facebook platform is setup, there isn't an easy way to discover news that is being submitted and shared, so I turned to the Twitter API to see what I could do--there had to be a way to find other fake news domains via the much more public Twitter.

I created an automated job that would take any of the 50+ fake news domains I'm targeting and search using the Twitter search API. This returns the top 100 Tweets that contained a URL using that domain. I'm processing the results and recording each of the Twitter users who are behind these Tweets and working on a new automated job to pull their top tweets and see if there are any new domains to be discovered there.

Many of the most popular Tweets around these fake news domains are central in spreading this news on Twitter and via Facebook. For now, I'm just adding new URLs to a list, and manually looking through them on a regular basis. I'm categorizing them into propaganda, news, and some other categories for possible future evaluation, or inclusion as part of the URL and graph harvesting process.

I am only pulling the tweets and new domains from the accounts who have a high amount of followers and retweets. These are usually the Twitter accounts associated with the fake news domains I"m targeting or similar sites. I am not doing this step for the regular news sites, as it isn't too difficult to find new news outlets, where the fake news sites are much more difficult to uncover and identify.

Detail Page


Publish Data Regularly To Github

I have automated jobs set up to regularly publish data for all domains being targeted across the news and propaganda sides of the discussion in the JSON format to Github. I am also updating the URLs that are indexed for each domain, including the latest Facebook shares (if they are pulled). 

All data is published to Github using the Github API. I am also publishing an HTML listing of news and propaganda, and details pages for each domain, allowing the URLs and Facebook share counts to be explored without having to wade through the JSON. It is all available as a single Github repository, allowing it to be downloaded, forked, or directly integrated using the raw JSON files.

I'm trying to make sure all data is updated for each domain at least once a day. Once again I'm limited by the Github API, as well as pulling URLs, and Facebook Graph data. It is important to me that all data is available openly on Github as machine-readable data so that anyone can integrate into their own work.

Having everything on Github also opens up up the opportunities for accepting pull requests, adding, and updating data beyond what I can do on my own. I am also leveraging Github issues for the repository to manage the roadmap, and feedback around the project. If the number of domains grows beyond a specific size, I will begin to spread across multiple Github repositories broken down alphabetically.

Detail Page


Updating Totals for Each Domain Daily

To better understand how things are working, as well as the scope of each domain I'm profiling I wanted to regularly update the numbers for the number of URLs indexed, and how many of them have had their Facebook shares pulled. I run this hourly so that I can get updated number of URLs and shares on a regular basis (daily at least).

If the numbers work out, I should be able to pull the Facebook shares weekly, and eventually identify trending numbers for each URL, across all the domains. Eventually, I'll establish a way to stop pulling for URLs that aren't trending, do not have any shares, and are just too old, or not relevant--it will just take time and refinement.

These totals for each domain will drive a reporting tool that I have planned. I want to easily allow for comparison of one or many domains across both the news and propaganda repositories. Eventually, I'll develop a variety of visual tools to help make sense of the data being pulled across all domains targeted by the system.

Detail Page


Managing The Jobs That Drive Everything

I am pulling the URLs for each domain, and the number of times it has been shared on Facebook. I have 16 separate servers running with separate CRON jobs to support all of this work. I have 147 separate jobs running at any point in time to accomplish what I'm loking to do. Two separate jobs are added for each domain I add into the system.

If I go beyond 6 domains being pulled by any of the 15 Facebook Graph servers, the jobs only run every 10, 20, or 30 minutes, with the default being every five minutes, when there are only five domains. The more servers I can add, the more I can scale this "horizontally", and speed up the process. The adding of each server is currently manual, but once the IP address is added to the system, it will immediately be assigned jobs. 

In addition to indexing the URLs, and pull from the FB Graph API, I have a number of other jobs running to clean up data and make available for reporting on in real-time. Right now things are elastic and automated. I can add new domains, and servers, and it will just keep chugging along pulling the data it needs. The jobs are self-creating--meaning it can be deleted, and it will rebuild itself, adding and removing jobs as it needs to scale and maximize the pulling of data.

Detail Page


Discovering The Facebook Graph For Each URL

Next, I wanted to understand how popular each of the URLs was by evaluating the number of times it has been shared on Facebook. I've spent a great deal of time in the Facebook developer area, looking through the Facebook Atlas, Graph, and Marketing APIs, and can't find any source for evaluating trending URLs, allowing me to discover things by keyword, and number of shares--maybe I'm missing something, and I"ll keep looking, but nothing stands out as a solution to me currently.

I can only pass a URL to the Facebook Graph API and get back the number of shares, and I can only pull a total of 4800 calls per daily, per 24 hours period per IP address. The pulling of URLs for the 70+ domains can be scaled vertically--meaning I can just purchase a larger Amazon EC2 instance, and scale up the number of calls I make using a single server. When pulling Facebook share details for each url, I am only processing between 3-5 domains, pulling every 5 minutes, per a single Amazon server--which has its own IP address.

Currently, I have 15 separate server instances pulling the Facebook Graph information for each URL, across the 70 domains. As I can afford it I will scale this up or down. Unfortunately defining the sharing graph for each domain, across all its URLs is costly from a compute standpoint. I'm also having to balance the pulling new URLs vs. the pulling of an update for each URL in an attempt to keep track of which URLs are trending upwards or downwards. 

I'm pulling the sharing statistics for URLs as fast as I can, with the Facebook API rate limits and resources to pay for Amazon EC2 instances my only bottlenecks. I'll keep pulling and prioritizing this portion of the process in real-time, 24 hours a day. For now it is running at capacity.

Detail Page


Pulling News and Propaganda

Similar to what Mike Caulfield looked at in his look at the news vs. fake news, I wanted to try and define what is propaganda by comparing it to what regular news outlets were publishing. I put together a list of over 50 "fake news" sites, and around 20 of the leading news sites. In my work I am not looking to algorithmically define fake news vs real news, I am just trying to establish a fingerprint of the domains behind each of these "news" outlets.

I setup a server, running a script that slowly pulls the URLs from the 70 domains I've targeted. Each page it pulls, it parses all links available on the page, adds them to a database, and repeats the process over and over. I do not pull any external links that exist outside each of the targeted domain, just focusing on the outline, structure, and content of each of the news and propaganda sites included.

In an effort to not behave like a denial of service attack, I only pull and process a URL each 5 minutes. I repeat this for all URLs targeted 24 hours a day. Some of the websites have just a couple hundred pages, but others are already in the 20 to 50K range and growing. I'll keep pulling URLs, adding new URLs, and scaling the compute and storage capacity as needed, and as I can afford. 

Detail Page


The Problem Of Fake News

The problem of fake news bothered me throughout the election, with friends and family regularly sharing stories that were complete bullshit, demonstrating very little awareness about the origin and truth of what they were sharing. I feel this is a problem that Facebook and Twitter should work to address, but there is only so much blame we can throw in this direction, at some point we need to step up and be the change we want to see as well.

Being intimate with the business models of Twitter and Facebook, I'm just not convinced the current incentive model will result in them doing anything about the "fake news", or I prefer to call it, the "real propaganda" problem. The web (Google) is built using advertising, and social media (Twitter & Facebook) have followed this playbook--making all of this about clicks, views, and shares--something they just will not be that interested in changing. Even if they say they are fixing it, I'm unsure they'll truly do what they say.

While we should be pushing back on these platforms to do better at filtering out "fake news" and "real propaganda", as well as cutting off access to their advertising networks for these sites, I wanted to do more to understand just exactly what was going on. To kick things off, I got to work pulling together a list of some of the common propaganda sites and spun up an Amazon EC2 server to bet to work pulling all the URLs from each domain, to better understand how these sites are operating, as well as how their content is being shared.

Everything is available on Github, and via what I'm calling domain literacy.

Detail Page


You can find what is next on the Github issue for this project, where I am publishing any bugs, enhancements, and other items I am working on to p ush this project forward.