Today we’re pleased to announce the release of the Open edX Analytics Devstack and want to take some time to explain how the project came about. Big thanks go to Philippe Chiu and Braden MacDonald for their heroic efforts in making this project come to fruition.
At last year’s very first Open edX Conference Hackathon, Philippe Chiu (from IONISx) suggested an awesome project: to run the entire edX analytics stack from a Docker container. The goal of this project was to develop something like the developer stack (Devstack) that is used by most Open edX developers when they develop patches for edx-platform. This “Analytics devstack” would contain all of the external dependencies needed by the analytics systems, installed on one conveniently isolated container.
Diagram showing all services co-existing inside the analytics devstack container
The edX analytics team had been doing most of our development on Elastic MapReduce clusters (on AWS), which conveniently includes all of the dependencies needed to run our code. However, this approach is cumbersome and prohibitively expensive for many open source contributors. Instead, we wanted to stand up an analytics stack inside a container that could ingest tracking logs (clickstream data), process them, and display the results on Insights (the analytics dashboard provided to instructors and course staff working on edX courses). Philippe and I spent the next two days hacking on all of the bits that needed to happen to achieve that goal. At the end, we had gotten pretty far. We were able to run the data pipeline (edx-analytics-pipeline), but still needed to wire up some of the services. After the hackathon, I spent a little time pushing it forward, mostly by converting the Dockerfile-based configuration to ansible roles and playbooks. This addition enabled us to run the ansible configuration anywhere we chose (including during a docker image build process, an AMI build or a vagrant image provisioning step).
Fast forward a few months, and Braden MacDonald (from OpenCraft) was looking to make some significant contributions to the analytics services. He also saw a need for a devstack equivalent for the edX analytics services, and developed a completely functional vagrant image that can run the entire stack. In so doing, he figured out a bunch of the details that had remained outstanding after the initial effort.
The edX engineering team was so impressed by this massive contribution that we thought it made sense to merge it with the existing work and produce a final product that could:
- Be installed into a normal Open edX devstack, or be spun up in an entirely separate virtual machine.
- Run on the same machine as the LMS, allowing for significantly simplified network configuration, without needing to worry about port forwarding and other such complications.
- Be tightly integrated with the edx/configuration repo, allowing for future simplified deployment to sandboxes and other edX development environments.
- Use the same deployment logic that is used in production.
- Take advantage of other edX infrastructure that supports the deployment and management of these independently deployable applications (IDAs).
The net result is a set of ansible roles and playbooks that have been merged into the edx/configuration repo. Now, with just a few commands, developers can stand a complete analytics development environment up inside a virtual machine. Within this environment, you can click around in the LMS, run the data pipeline, and then refresh a page in Insights to see the charts change based on your actions!
Want to try it out? Check out the documentation on the Analytics Devstack!
Want to make the analytics devstack even better? We are hoping to extend it in the following ways:
- We would like to be able to run the data pipeline acceptance tests in this environment. Currently, there are some hardcoded dependencies on S3. The edX engineering team is planning on doing this in the near future.
- We have some analytics-related configuration stuff in the edx/edx-analytics-configuration repo and other stuff in the edx/configuration repo. We would like to figure out a strategy to reduce the complexity of this situation, by moving the logic from one into the other, so that you have one place to go to find analytics-related operational stuff.
- Make the edx-analytics-pipeline deployment procedure more idiomatic and consistent with other services, such as Insights and the Analytics Data API.
- Allow Insights to display today’s data instead of always displaying yesterday’s data. Some reports don’t currently show any changes until the next calendar day.
I cannot thank Braden and Philippe enough for putting this all together and making it possible for all of us to work on the analytics stack more easily and effectively. We look forward to seeing what changes the Open edX community has in mind for Open edX Insights, and what cool projects will develop out of the 2015 Open edX Hackathon! We hope to see you there.
Gabe Mulley is a Principal Software Engineer on the edX Analytics Team.
1,071 total views