From the Hands of David Yanofsky

May 19 4:23am

Chartbuilder 0.6

For the last month or so reporters at Quartz have been building charts with the bleeding edge version of Chartbuilder. Have you noticed? Didn’t think so.

Last night I was able to merge these changes into the master branch—it includes some long long long outstanding pull requests and bug fixes.

You can try it out using the hosted version of Chartbuilder, but as always, Chartbuilder really sings only after you customize it and host it yourself.

The version number is 0.6 as I still can’t get myself to bump it to 1.0 until it has proper documentation, a complete suite of unit tests, a less-than-utilitarian interface and fewer situations where the whole thing breaks down.

I hope you find these changes helpful. As you come across bugs or desire more functionality please let me know by submitting them on github or by email or twitter or bike messenger. Better yet, take a stab at fixing it yourself or any of the other open issues, currently there are 34 waiting to be fixed.

Here are the ones that were just closed:

Bug Fixes:

  • Prevent series colors from changing when the type of series is changed
  • Prevent the plague of the skinny columns
  • Strange behavior with titles of bargrids
  • Auto title a chart with only one series works again

Enhancements:

  • the input accepts numbers that have $ £ € or % on them
  • the input accepts numbers with your regions decimal and thousands separator i.e 1,300.10 is valid on the US localed machines and 1’3000,10 is valid on France localed machines (thanks Parker Shelton)
  • the input accepts excel error cells e.g. #N/A (and plots them as blanks)
  • there are lots more semicolons in the code
  • the automatic date axis format is much much better (thanks Parker Shelton)
  • you can take a regular date series and format it as quarters
  • Up to 10 y-axis ticks by default
  • font loading to support https (thanks to Imran Nathani)
  • completely integrate Bower and include installation instructions for using it. (thanks to Alan Palazzolo)
  • the HTML table output uses the number formats of a user’s locale
  • Inline styling isn’t overwritten on save if there is no new rule. (thanks to Alan Palazzolo)

August 25 5:29pm

Changing the author of track changes comments in Word

A friend of mine needed to submit a Word document with track changes for school, but she needed it to be anonomized. The default behavior for word is to attach the name listed in Word preferences to every change when track changes is turned on. However if you change your name in the preferences all previous edits still remain under the old name in the document and are not editable.

With a series of command line functions you can change all the track changes in a word document to any name you need because the docx format is actually just a zip file of xml documents that contain all of the word doc’s content and meta data. (It’s specified here)

Here’s the code you can run in your mac’s terminal to change all of the track changes author names, it assumes the file you’re editing is on your desktop

cd ~/Desktop
unzip myDocument.docx -d anonDocument/
grep -rl "w:author" ./anonDocument | xargs sed -i '' 's/w:author="[a-zA-Z0-9 ]*"/w:author="anonymous edit"/g'
cd anonDocument
zip -r ../cleanDocument.docx .
cd ..
rm -r anonDocument
open cleanDocument.docx

This is what that does:

  1. change the working directory to the Desktop
  2. unzip the word doc into a new directory called anonDocument
  3. search all of the files in the word doc package and replace any comment or track changes author with anonymous edit
  4. change the working directory to the anonDocument directory
  5. create a new word doc on the Desktop called cleanDocument.docx
  6. change the working directory back to the Desktop
  7. remove unzipped document folder from the desktop
  8. open the new document

The code is on github, as well.

August 17 11:27pm

That was terrifying, exhilarating, and distracting

I released Chartbuilder last week not knowing what to expect. I had thrown up some gists in the past, but I’d never open-sourced an ongoing project.

I had a fear was that it would be shrugged off with a whimper of notice. I was worried that people who I admired in the industry would see my efforts and think of them as trivial.

I still view Chartbuilder as a simple solution to a simple problem, why shouldn’t they think so too? (I also wrestle with the “I’m not a developer” complex described perfectly by Noah Veltman)

I wrote up a piece about how the tool has made Quartz better, and the internet–or at least the little corner of the internet that I operate in–went nuts.

Turns out, lots of people in news and elsewhere have been dying for a way to easily create and export charts as images.

Here are some people who I’ve never met (well, with the exception of the one who I went to middle school with…TRIVIA!) whose work and opinions I admire and respect saying nice things about Chartbuilder: 

Chartbuilder is already being used in other newsrooms, and has gained fabulous contributors on github.

image

Now that it’s open source, the version of Chartbuilder today is significantly better than the version that only existed on Quartz servers last month.

Terrifying, exhilarating, and distracting. Also, incredibly fun.

July 30 3:14pm

quartzthings:

We’ve just open-sourced Chartbuilder, the tool that all reporters use at Quartz to quickly make simple charts at graphics-desk quality. Read more about how Chartbuilder came to be and how we use it in David Yanofsky’s piece for the Nieman Journalism Lab.

quartzthings:

We’ve just open-sourced Chartbuilder, the tool that all reporters use at Quartz to quickly make simple charts at graphics-desk quality. Read more about how Chartbuilder came to be and how we use it in David Yanofsky’s piece for the Nieman Journalism Lab.

(via quartz)

March 29 10:06am

theannotationlayer:

A couple of hours ago, I was telling a colleague of mine how good this Periscopic graphic about gun deaths is over beer. If you haven’t seen it, you should check it out right now. The way that the dots (people) just drop off of their potential lifespans, and how, once the animation gets up to full speed, the whole thing looks like a machine gun firing…it’s super affecting.
But I’m starting to question the editorial judgement a little bit. I took another look at the graphic tonight after finding it to share the link with my colleague. I hadn’t actually realized that you can click on any one of those lines—which, of course, represent real individuals—and be taken to the news story about the corresponding person’s death.
After filtering out all but the deaths in the past seven days, I found and clicked on one that had taken place in my own borough of Brooklyn. Apparently, the victim had stabbed somebody, and then lunged with his knife at the cops who arrived on the scene. The cops ended up shooting and killing him. 
I’m not sure that including gun deaths like this one in the graphic was a sound decision. Clearly the graphic was intended to inform the debate about gun regulation in the US. It was published when Sandy Hook was very fresh in everyone’s mind and Wayne LaPierre was on TV almost every day.
So, in addition to tacitly arguing for tighter gun control, is it also arguing that police officers shouldn’t have guns? And, is it really fair to say that someone who gets shot after threatening a group of cops with a knife has had his life stolen from him? He played some role in his demise, no?
Obviously, I have no idea what actually happened that night. The cops could have been trigger-happy or bigoted or just a bunch of dumbasses. Maybe they did fire without cause and maybe they did steal a life. I’m not sure.
But the point is that Periscopic isn’t either. They made the decision to include all gun deaths and to declare the consequent lost years of the victim “stolen” regardless of who fired the gun and whether or not it was self-defense.
And I understand—there are a lot of gun deaths in this country, unfortunately, and going through every individual death probably isn’t all that feasible for the Periscopic team.
But if you’re going to take on a project this ambitious and important, I think that you should do your best not to be misleading. A simple way to do that would be to not include cases where a cop was the shooter. Surely, police officers have caused a slew of unnecessary gun deaths. But save that injustice for a different graphic.

theannotationlayer:

A couple of hours ago, I was telling a colleague of mine how good this Periscopic graphic about gun deaths is over beer. If you haven’t seen it, you should check it out right now. The way that the dots (people) just drop off of their potential lifespans, and how, once the animation gets up to full speed, the whole thing looks like a machine gun firing…it’s super affecting.

But I’m starting to question the editorial judgement a little bit. I took another look at the graphic tonight after finding it to share the link with my colleague. I hadn’t actually realized that you can click on any one of those lines—which, of course, represent real individuals—and be taken to the news story about the corresponding person’s death.

After filtering out all but the deaths in the past seven days, I found and clicked on one that had taken place in my own borough of Brooklyn. Apparently, the victim had stabbed somebody, and then lunged with his knife at the cops who arrived on the scene. The cops ended up shooting and killing him. 

I’m not sure that including gun deaths like this one in the graphic was a sound decision. Clearly the graphic was intended to inform the debate about gun regulation in the US. It was published when Sandy Hook was very fresh in everyone’s mind and Wayne LaPierre was on TV almost every day.

So, in addition to tacitly arguing for tighter gun control, is it also arguing that police officers shouldn’t have guns? And, is it really fair to say that someone who gets shot after threatening a group of cops with a knife has had his life stolen from him? He played some role in his demise, no?

Obviously, I have no idea what actually happened that night. The cops could have been trigger-happy or bigoted or just a bunch of dumbasses. Maybe they did fire without cause and maybe they did steal a life. I’m not sure.

But the point is that Periscopic isn’t either. They made the decision to include all gun deaths and to declare the consequent lost years of the victim “stolen” regardless of who fired the gun and whether or not it was self-defense.

And I understand—there are a lot of gun deaths in this country, unfortunately, and going through every individual death probably isn’t all that feasible for the Periscopic team.

But if you’re going to take on a project this ambitious and important, I think that you should do your best not to be misleading. A simple way to do that would be to not include cases where a cop was the shooter. Surely, police officers have caused a slew of unnecessary gun deaths. But save that injustice for a different graphic.

January 19 2:27pm

A final fiscal cliff wall story

A friend of mine was at a party last night. The hosts pull her aside and say,

Lauren, we need to talk.

They take her into the bathroom.

"For the last month we’ve looked out this window in our shower and saw this big ‘No’ taped on that wall over there…we couldn’t figure out what it was or why it would be there. We were obsessed. We would try to figure it out all the time. But now it’s gone.

The other day we were on Facebook and we saw that you liked a picture of the wall!

WHAT IS THIS WALL?

http://www.hastheusgoneoffthefiscalcliff.com/

January 10 5:48pm

How we built hastheusgoneoffthefiscalcliff.com

quartzthings:

Today we put hastheusgoneoffthefiscalcliff.com to sleep after a month of service, so we wanted to explain how it came to be.

How did the site work?

We had a DSLR plugged into a AC power supply, on a tripod, hooked up to a Mac Mini with a USB cable:image

  1. The Mac Mini ran a bash script every 5 minutes through the crontab
  2. the script triggered a camera capture through the USB cable and downloaded the image
  3. the script created two smaller resized copies of the image (one for the site one for social media use)
  4. the script uploaded those images to our web server, replacing the previous captures
  5. the script put a timestamp in the full sized image’s filename and moved it to an archive on the Mac Mini (for posterity)

The webpage was hard coded to the location of the image file and had some javascript that would update the image every 2.5 minutes (faster than images are taken to reduce the lag between what one user may see and what is actually in the office). That script used jQuery and looked like this:

setInterval(function() {
$("#camimg").attr(
"src",
"http://hastheusgoneoffthefiscalcliff.com/imagecapture_1000.jpg?timestamp=" + (new Date()).getTime()
)
},1000*60*2.5);

This redefined the path to the image every 2.5 minutes; appended with the date stamp as a parameter to make sure we dont get a cached version of the image.

The clickable areas were defined in a Google spreadsheet that was loaded in every time the page loaded and on each subsequent image replacement. We updated this document by hand every time we changed the wall.

How did the wall work?

The hard way: Every morning we got up, printed out some headlines, tweets, quotes and pictures, tiled them together and taped them to the wall.

Why??!

The idea for the single serving site from the beginning was Zach’s. Sometime in October he noticed that that the hastheusgoneoffthefiscalcliff.com domain was available to register, and he brought it up as something maybe worth pursuing. The first idea he sketched out was this:

image

A man who slowly inches towards the edge of a cliff paired with links to stories around the web about the topic and an explanation of what the fiscal cliff was. That conversation devolved into the merits of different depictions of cliffs:

image

After seeing Brian Rea’s coverage of the US Presidential Election Night on Instagram and remembering the website for Sagmeister and Walsh, I thought about making a webcam of a wall of stuff. I made this as a proof of concept:

image

Everyone got on board, and this is what resulted over the next month:

- David Yanofsky

View the code on github

Development time: 2 days

December 17 10:30am

GIFing the 1040 and other notes on hacking the IRS website

quartzthings:

I published a story on Thursday about the complexity of the US tax code over time and used the length of the IRS Form 1040 over time as a proxy. It lead to responses like this one on Facebook:

Imagine the human being who took the time to make this. We must deeply honor the focus of that person.

Apparently it seemed like a crazy task to filter through almost 100 years of documents and tabulate information about them. Let’s asume they thought I was doing this by hand.

I wasn’t. Not even the GIF. Heres how:

Making the GIF

There’s a command line tool called ImageMagick that will both turn a PDF into a series of images and then turn that series of images into a GIF. These are the two commands using imagemagick I used to accomplish this:

$ for infile in *.pdf; do convert -density 400 -resize 400 -trim -extent 500x700 -gravity north $infile jpeg/$infile.jpg;  done
$ convert -delay 25 -loop 0 jpeg/f1040__*-0.jpg animated1040.gif

The first line tells ImageMagick to look at every PDF in the current directory, convert it to a 400px-wide PNG using a resolution of 300ppi for vector data, extend the edges of the image to 500px by 700px (anchoring the image to the top center of the new bounds), save it in the folder named jpeg. The second line tells ImageMagick to merge every .jpg file in the jpeg folder (i.e., every file I just created) with a file name ending in “-0.jpg” (this is the first page of the former PDF) into a GIF called “animated1040.gif” that flips through each image at 25 hundredths of a second and loops continuously.

After cleaning and optimizing it in Photoshop I had this.

image

Finding the files

All of this was dependent on having all of these 1040s. When I started looking for them, I was hoping some think tank or library would have an archive of the documents.

I decided to start simple. The current form is easy to find. A web search for “1040” revealed the IRS served PDF as the top result. Now what about the old forms? A web search for “2010 Form 1040” also returned a PDF on the IRS website but it had a slightly different URL: www.irs.gov/pub/irs-prior/f1040—2010.pdf. “irs-prior” — I like the look of that — “f1040—2010.pdf” Could all of the filenames be systematized? 

Yes! A couple minutes of URL manipulation in my browser allowed me to find that there were files at this URL dating back to 1913 (though there were no forms for 1914 and 1915, since those years used the same 1913 form).

Downloading the docs

The next step was to download all of the files. Should I change the year in each URL and save as from my browser? TERRIBLE IDEA. I opened up my command line used the interactive prompt of python to download all the files super quick. It went something like this:

$ python
>>> import urllib
>>> years = range(1916,2012)
>>> for y in years:
. . .    urllib..urlretrieve("http://www.irs.gov/pub/irs-prior/f1040--%s.pdf" % (y,), "f1040--%s.pdf" % (y,))
. . .  

Here’s what that means:

  1. start python
  2. load the library I need to download files
  3. create a list of years that I want to download: start in 1916 end in 2011 (one year before 2012) call it “years”
  4. cycle through every year in that list calling the current year “y”
  5. download the file using the url and naming system I figured out before, save the file using the same system

Two minutes later there were 97 PDFs in my folder for this project. I opened up the 1913 form in my browser and downloaded it. BOOM. Every 1040 ever.

Counting pixels

So I now we had all these files and we had to quantify exactly how much more complex they got over time. My first idea was to use the amount of ink used on each document as a proxy for complexity. I wanted to count the number of black pixels in each document. I used ImageMagick to convert all the PDFs to images and could start counting pixels.

Using a Python library called PIL, I opened up each file with Python, converted it to grayscale and counted the number of black pixels, calculated the ratio of black-to-total pixels, associated that JPEG with the appropriate year and save that information as a JSON blob and CSV spreadsheet. I’ll save you from that code here, but you can see it here on github.

Using the CSV I got out of that, I made this chart showing the amount of “ink” on the form over time:

image

It was antithetical to what we knew was true. If the amount of printing was a proxy to complexity, this chart would show that the tax code is less complex than the first years of the system.

Were the older documents just bigger? Use larger type? I charted the same information but as a ratio of amount of black per page. Same story. Then I realized that the older documents have more instructions on them! (More recent 1040s include instructions in a separate appendix.) What if we just looked at the tabulation page. No luck. Apparently today’s documents are more ink-efficient than those of yesteryear.

Counting lines

I crafted a new strategy: count the number of line items on the form. (Our methodology for what we counted is recounted in the piece.) The slow way to do this would be to double-click each file in a document viewer, count how many lines were on each page and input that into a spreadsheet, hoping I dont miss anything or make a typo. (It was beer o’clock in the office.)

The fast way is to write more code. I created another Python script that would open up each page of every document individually and prompt me to enter how many lines were on that page and whether I should overwrite the current number of lines I recorded or add these to the number of lines already recorded. Once complete, the script saved a spreadsheet of the recorded information. 

I made this chart.

image

Thinking about the transfer of instructions from the form to a separate document, I decided to take a look at the instructions booklet and see how those have changed over time. I used the same exact scripts as above to download all of the instruction files by changing the URL slightly to “i1040” from “f1040.” (This naming convention was also revealed by a web search for “1992 form 1040 instructions.”)

Counting pages

The recent documents were long: 2011 is nearly 200 pages. I used some more code to count the number of pages, and it looked like this:

$ python
>>> from pyPdf import PdfFileReader
>>> years = range(1939,2012)
>>> for y in years:
. . .    print y, PdfFileReader(file("i1040--%s.pdf"%(y,),"rb")).getNumPages()
. . .  

I copied the data from the output (I didn’t save this to into a file for speed’s sake), pasted it into Excel, and made this chart:

image

Fifteen years ago, tax instructions were half the size! More striking, the booklets from the ’80s have smaller pages, but were still significantly shorter than today’s.

So now I had a GIF, three charts, and a whole bunch data. All that was left was words.

Read them all here: Line for line, US income taxes are more complex than ever

-David Yanofsky

View the code on github

Development Time: 1 day

October 15 6:10pm

My latest interactive is a comparison tool for the Center for Global Development’s Commitment to Development Index, a barometer of developed countries dedication to supporting poorer nations enhance their standing.
The piece is responsive and accompanied by a bunch of words by Tim Fernholz. It is running the visualization library d3 making this my second published work to leverage it. (The first was a dashboard for the release of the jobs report that I will write about soon) This would have not been possible without Scott Murray’s tutorial.Inspiration also came from the recently launched State-by-State interactive from the Bloomberg Visual Data group, seeing the HTML and CSS markup in their drop downs was very instructive.

My latest interactive is a comparison tool for the Center for Global Development’s Commitment to Development Index, a barometer of developed countries dedication to supporting poorer nations enhance their standing.

The piece is responsive and accompanied by a bunch of words by Tim Fernholz

It is running the visualization library d3 making this my second published work to leverage it. (The first was a dashboard for the release of the jobs report that I will write about soon) This would have not been possible without Scott Murray’s tutorial.

Inspiration also came from the recently launched State-by-State interactive from the Bloomberg Visual Data group, seeing the HTML and CSS markup in their drop downs was very instructive.

July 26 6:46pm

That feeling when you weren’t expecting a byline.

That feeling when you weren’t expecting a byline.

(Source: bizweekgraphics)

Read More

David is also elsewhere

Appendix

This website was designed and coded in 2013 by David Yanofsky in Brooklyn, New York and Los Angeles California using Sublime Text. The body text is typeset in Courier Prime, or Courier, or Monaco, or something else–because that's how the internet works.