DataDirect Cloud is a connectivity platform as a service (cPaaS) running on Amazon Web Services (AWS) that provides data connectivity through open industry standards (OData, ODBC, JDBC). To illustrate the critical nature of the service, Intuit uses it for real-time connectivity for 10,000 Salesforce users to Oracle data behind their corporate firewall. If the service goes down, Intuit agents are no longer able to access order histories, case histories or commissions’ calculations. Beyond this use case, we have thousands of DataDirect Cloud users connecting various data sources for their core business operations.
DataDirect Command Center (DCCC) is a collection of Sumologic dashboards that measure key metrics focused around DataDirect Cloud user experiences. Each of the dashboards is displayed on an individual monitor and physically mounted to display a collective view across the following metrics:
- Top users
- Top Errors
- Trends for Key Customers with personalized screens broken out by account types
- Production Usage Metrics
- Throughput
- Error Counts
- Data Source Volumes
- Interface Volumes
- Types of Queries Executed
- Failure Rate
- Integrated JVM metrics
Notes:
No customer specific information is monitored per Progress Software privacy policies.
For monitoring JVM memory, code was developed using JMX to feed metrics from Amazon into Sumologic providing a 360 view of the systems.
Eric Brown, Cloud Engineering Manager, led the effort to create the DCCC and both are pictured above. The DCCC leverages the built-in dashboarding functionality in Sumologic and it was developed to enhance DataDirect Cloud user experiences. The service enables new data connectivity functionality that pushes traditional workloads making it imperative to deliver a great user experience. In response, the engineering organization actively monitors the DCCC to detect anomalies in usage patterns and take appropriate actions as described in the next section on visualization. Sumologic has delivered big data analytics throughout the organization, and we did not have to engage our data scientists for the project.
Below are tips for creating the search queries in Sumologic for DCCC:
- Start small: small in scope and small in your search window (timeframe)
- Stay focused: take one question you want to know and create a query to answer it. Then go to the next. This is sometimes hard when there are many questions to answer, but the focus will help you learn Sumologic syntax and particularities much faster.
- Filter steps: It’s impossible to know about all the logs in a moderately complex system. Start searches broad then filter down one step at a time. Sumologic uses a “pipe” to represent a filter step. Keep in mind that each time a log passes through the pipe (“|”) it may not come out the other end. Think of these pipes like “gates”. There are keywords that let logs “through the gate” like a password where otherwise the filter condition would have blocked it.
- Multiple Filters: adding filters one by one and checking results was the most efficient way to move forward in developing larger, more complex queries. It’s much easier to troubleshoot a single filter statement than 4 back-to-bacl/
- Confirm, confirm, confirm: Make sure your queries are “correct” in every sense. Nothing is worse than making judgments on queries that return unintended results.
- Comments: Use the “//” to comment out a line (saves a bunch of time troubleshooting queries). It also provides you space for comments. We have many generic queries about user logs where you just have to add the user id to the query. We use the “//” to provide instructions to the Sumologic user.
- Ask for Help: The documentation is great and support is outstanding. Kevin at Sumologic answers questions quickly and accurately (even over the weekend).
Monitoring Usage example for a QA user (using Comments)
When to use alerts versus real-time data visualizations
Devops has several tools for monitoring applications and systems across a mix of Sumologic and open source tools such as Zabbix and Cacti. The R&D team on the other hand is interested in very specific information captured in DCCC and both teams work together to exchange queries and intelligence. When considering intelligence in our systems, humans are fantastic at detecting patterns in visualizations when the question is unknown. On the other hand, alerts are great when you know the question and answer. In most cases, it’s not constructive to look at large amounts of raw data.
When it comes to data visualization, dashboards are more than just dashboards
They can act as a starting point for deeper investigations. This helps to “ramp-up” engineers that are new to Sumologic by providing a starting point during troubleshooting activities. The R&D team started to look at visualization since they may not know what patterns to detect for alerts from the dashboards in DCCC. One example is that the R&D team detected an anomaly in “user experience” through visual insight of a dashboard and proactively alerted the customer before they contacted our support team. This is a great example of effective monitoring through data visualization and customer service in the cloud era.
Alerts are very useful, some information is not mission critical but still very important to growing our products. We’ve created queries and attached them to automated alerts through Sumologic to monitor the “user experience” of our evaluation users. Every morning we get an automated email with exception reports from which we decide whether or not to reach out to specific users proactively for help.
Once a visualization uncovers value for customers, those are then integrated into the larger alert system run by devops over time.
How DCCC forces R&D to use best practices for logging messages
In building common queries that are shared between R&D and Devops, best practices were developed in logging as follows:
Naming your collectors
Do yourself a favor and settle on a naming convention for your collectors up front. We have several deployments in the cloud. Each deployment consists of many virtual machines. Each virtual machine can have one or more programs generating logs. Sometimes you want to search all the logs within a particular instance to piece together a “chain of events”. If all the collectors for a particular instance have the same prefix it’s easy to start searching your logs (and it’s easier to remember). When trying to troubleshoot a workflow wesometimes look for a username in all the logs. That’s easy to do using wild cards in the Sumo Logic search. We use the format:
[Product]-[Instance]-[Function]-[IPAddress]
So we might have the following:
[coolproduct]-[live]-[nat]-[127.0.0.1]
[coolproduct]-[live]-[website]-[127.0.0.2]
[coolproduct]-[live]-[db]-[127.0.0.3]
[coolproduct]-[live]-[not]-[127.0.0.4]With this structure it’s easy for me to quickly search all the logs in the production instance of a product using:
_sourceCategory=coolproduct*live* and “username”
_sourceCategory=coolproduct*test* and “username2”
And, of course, a focused query is just as easy to follow/remember:
_sourceCategory=coolproduct*live*website and “username”
Or search across all instances like this (like testA, testB, live)
_sourceCategory=coolproduct* and “username”
Common Log Structure
The structure of your logs are also important. If you can keep a similar pattern on log contents that will help you with your parsing logic. For example, if you can settle on some common fields across most logs (like we did) you can start your logs entries like this:
[dateStamp][loglevel][username][x][y][z]
Use key/value pairs where possible, this makes parsing easier. For example:
- success=true
- ms=341 (response time)
- version=2.3
- Example Log
25-Aug-2015 17:08:18.264 INFO [http-nio-8080-exec-1] [username][FwO3Wvy5frS6O9wART3Y].[login] [success=true][ms=1242][bytesIn=91][bytesOut=1241[clientVersion=2.0.1.64][timezone=America/New_York][flags=1][portNumber=xyz][connectionRetryCount=0]
Our product integrate with users backend systems and we typically include error messages from those backend system into our logs. This allows us to alert our users to issues they may not know about on their end.
What’s next for DCCC and next phase of analytics for Sumo Logic data?
The Sumologic dashboards are fantastic and there are plans in the DataDirect Cloud R&D offices in Research Triangle Park, NC to expand the Command Center to common areas to crowd source pattern detection in the dashboards. It’s also a unique opportunity to engage more employees directly with our technology and serves as a constant reminder of all the impressive work that goes into devops and cloud R&D for successfully running business critical cloud applications.
DevOps is planning to expand the concept of a Command Center to Progress Software corporate headquarters in Bedford, MA for even greater visualization across the complete portfolio of cloud applications.
About the Author
Sumit Sarkar is the Chief Data Evangelist for Progress Software. You can follow him on LinkedIn www.linkedin.com/in/meetsumit
and Twitter @SAsInSumit.