Real or Fake? (news articles)
Everyday we are bombarded with information from more sources than we can count, and it can be challenging gaging the validity of each of these pieces of information.
In this exploratory project I decided to look at news articles verified to be true, and an equal number of articles verified to be fake. The data is publicly available(link added below) and the newest article entry was from 2017. A lot of work had to be done to run any kind of statistical analysis, but I will only discuss the main challenges I encountered.
The first problem was that whole paragraphs are almost impossible to work with, so, each one had to be split into individual words and standardized (ie. all lower case).
The second hurdle came when trying to run any kind of analysis; punctuation marks, filler words, and repetition are incredibly frustrating. Luckily many data scientists have come across this issue so natural language tools were readily available. With these tools and some of my own layman solutions I was able to clean up the data. My knowledge is rudimentary, so I chose to do perform a simple frequency analysis on words used in real news vs fake news, and I decided one of the most pleasant ways of portraying the results were these two gorgeous word clouds:
I had no idea what I would find, and a very general hypothesis: word will revolve around politics. I should mention I omitted the articles’ titles as well as specific names.. really just Hillary Clinton and Donald Trump since variations of these names were equally significant in both real and fake articles, as well as taking the #1 and #2 spot on both frequency tables.
It was difficult to come up with any kind of conclusions even with the word clouds,needless to say politics is in fact the most common topic so in that sense, I suppose my hypothesis was correct!
I did run one additional test which calculated the average number of words used in real and fake articles, but this time I used both the articles’ titles and text, and separated them into their own categories. Here are the results:
Average number of words in title ( real , fake ): ( 6.94 , 7.56 )
Average number of words in text ( real , fake ): ( 471 , 349 )
On average, fake news titles contain about 9% more words than their real news counterparts. Conversely, real news text contain about 35% more words than their fake news counterparts.
Interpreting the results from this study is a bit tricky, mainly because I could not find similar projects to compare my results with, and for this particular topic the age of the data is extremely important.
My next step could be running this analysis on bigger and more recent data sets. Additionally I could group data based on the date articles were published, by news outlet, maybe even measure public sentiment by analyzing comments posted online for the articles. Regardless, I will need more data if I hope to validate any findings.
See you next time!
Questions? Just want to chat? Leave a comment!