I talk to some of my friends and sometimes people have issues understanding why some software is important and how some of these pieces of technology are useful. It's understandable because in the tech industry some of the stacks are quite massive, and new technologies are invented as new issues arise (new issues arise everyday).
This page is just my views & readings on some of these technologies. This page is by no means a tutorial or a guide.
The category of Big Data holds these following technologies because the technologies listed below become useful when you have an enormously large dataset. If you're needing to do processing for a huge dataset you would utilize one of the technologies. Most of them offer scalability and a large number of contributors as well.
Focuses on stream processing/complex event processing. Basically for some incoming data into the system it allows for realtime computation on it. At Twitter this is used to process and analyse every Tweet to provide analytics for both users and to advertisers. They also use Storm to run their anti-spam, and content discovery classifiers.
More uses are to transform unstructured data into structured data. For example, if you wanted check the spelling and correct it for each tweet then you can use Apache Storm for that. The process would be: Tweet coming in -> Check Spelling -> Fix Spelling -> Send to Database -> Display to users. Basically the idea is that Storm can be used to process streaming data.
Storm processes data one at a time - meaning that it processes each record before moving on to the next one.
If you'd like to know more about why Storm exists then checkout the Rationale of why it was created.
Spark is inherently a data analytics platform. One of the key facts is that it's an in-memory platform meaning that everything is stored in RAM rather than on disk. According to users online it's mostly used for "speeding up batch analysis jobs, iterative machine learning jobs, interactive query and graph processing".
At eBay they use Spark for many things. One example is that they run Machine Learning jobs using the MLlib library. They use Spark to run Machine Learning jobs in the background. At some other companies they use Spark to process their analytics that are coming in. Spark is also noted to be 100x faster than Apache Hadoop's MapReduce.
Spark processes data in batches - meaning that it processes more than one record at a time.
Apache Spark vs Storm
Apache Spark vs Hadoop MapReduce
Continuous integration tools
TravisCI is a distributed continuous integration service that builds and tests projects that are hosted on Github. Basically TravisCI sets up a hook on Github that monitors for any commits on branches you specify. You have to specify a travis.yml file. When any commits occur it follows the steps you give in the travis.yml file.
Say you had a jekyll project set up. You could essentially re-build the jekyll project and upload it onto your webserver using TravisCI.
Very similar to TravisCI, but also supports private repositories for free and gives access to the SSH server it tests/builds your code on. In configuration it uses a circle.yml file instead of a travis.yml file. The setup of the file is a little different, but there are enough tutorials online.
Here are some sample files: 1. If you're not sure the $(whatever) basically means that it's an environment variable, which is something you can set up on CircleCI and TravisCI. Environment variables help you hide passwords by storing them on CircleCI rather than exposing them to users online.
Jenkins is an open-source continuous integration server. Meaning that it does things similar to CircleCI and TravisCI, but you've to selfhost it. Some large companies use this because it's customizable. Jenkins is used by Facebook, eBay, Etsy, Github, and many other companies .
TravisCI vs CircleCI
Mesos's slogan is "Program against your datacenter like it’s a single pool of resources". In this category you'll see technologies that were created to help manage a large number of servers, and are usually used by companies that are utilizing 1,000+ servers.
Apache Mesos sets out to "[abstract] CPU, memory, storage, and other compute resources away from machines". For companies with large datacenters it gives them the ability abstract away the lower level functionality of each particular server and give them access to higher level functionality that they might need. A startup Mesosphere advertises their operating system, which is based on Mesos, as "the datacenter operating system".
Large companies such as Twitter, Netflix, AirBnB, eBay, and many other use Mesos for certain applications. From Wikipedia:
There is a good introduction at reference .
Configuration management tools
Chef is a tool that is used to streamline the provisioning and configuration of new servers. When you think about scaling your product from a single server to 5 or 10 or 100s of servers you've to start dealing with automatically setting up and configuring servers rather than individually. Chef is a tool that was created to solve this exact issue.
When using Chef you write recipes that describe what Chef should do and how it should manage each of the applications you're setting up on each machine. These recipes are written in Ruby. There are pre-created recipes you're able to use for things like MySQL, Nginx, Apache Hadoop, Python, and much much more. If it's a popular tool then a Chef recipe has been written for it. This recipe dictates how the application is installed and then its ran on the server. A set of recipes is called a cookbook!
Chef is used by major companies such as AirBnB, Mozilla, Facebook, Rackspace, and Expedia.
Puppet vs Chef