The tool for beautiful monitoring and metric analytics & dashboards for Graphite, InfluxDB & Prometheus & More
The open-source platform for monitoring and observability.
Grafana allows you to query, visualize, alert on and understand your metrics no matter where they are stored. Create, explore, and share dashboards with your team and foster
The open-source platform for monitoring and observability.
Grafana allows you to query, visualize, alert on and understand your metrics no matter where they are stored. Create, explore, and share dashboards with your team and foster a data driven culture:
Visualize: Fast and flexible client side graphs with a multitude of options. Panel plugins for many different way to visualize metrics and logs.
Dynamic Dashboards: Create dynamic & reusable dashboards with template variables that appear as dropdowns at the top of the dashboard.
Explore Metrics: Explore your data through ad-hoc queries and dynamic drilldown. Split view and compare different time ranges, queries and data sources side by side.
Explore Logs: Experience the magic of switching from metrics to logs with preserved label filters. Quickly search through all your logs or streaming them live.
Alerting: Visually define alert rules for your most important metrics. Grafana will continuously evaluate and send notifications to systems like Slack, PagerDuty, VictorOps, OpsGenie.
Mixed Data Sources: Mix different data sources in the same graph! You can specify a data source on a per-query basis. This works for even custom datasources.
Hi everyone,
I recently joined raintank and I will be working with @torkelo, @mattttt , and you, on alerting support for Grafana.
From the results of the Grafana User Survey it is obvious that alerting is the most commonly missed feature for Grafana.
I have worked on/with a few alerting systems in the past (nagios, bosun, graph-explorer, etsy's kale stack, ...) and I'm excited about the opportunity in front of us:
we can take the best of said systems, but combine them with Grafana's focus on a polished user experience, resulting in a powerful alerting system, well-integrated and smooth to work with.
First of all, terminology sync:
alerting: executing logic (threshold checks or more advanced) to know the state of an entity. (ok, warning, critical)
notifications: emails, text messages, posts to chat, etc to make people aware of a state change
monitoring: this term covers everything about monitoring (data collection, visualizations, alerting) so I won't be using it here.
I want to spec out requirements, possible implementation ideas and their pro's/cons. With your feedback, we can adjust, refine and choose a specific direction.
General thoughts:
integration with existing tools vs built-in: there's some powerfull alerting systems out there (bosun, kale) that deserve integration.
Many alerting systems are more basic (define expression/threshold, get notification when breached), for those it seems integration is not worth the pain (though I won't stop you)
The integrations are a long term effort. I think the low hanging fruit ("meet 80% of the needs with 20% of the effort") can be met with a system
that is more closely tied to Grafana, i.e. compiled into the grafana binary.
That said, a lot of people confuse seperation of concerns with "must be different services".
If the code is sane, it'll be decoupled packages but there's nothing necessarily wrong with compiling them together. i.e. you could run:
1 grafana binary that does everything (grafana as you know it + all alerting features) for simplicity
multiple grafana binaries in different modes (visualization instances and alerting instances) even highly available/redundant setups if you want to, using an external worker queue
That said, we don't want to reinvent the wheel: we want alerting code and functionality to integrate well with Grafana, but if high-quality code is compatible, we should use it. In fact, I have a prototype that leverages some existing bosun code. (see "Current state")
polling vs stream processing: they have different performance characteristics,
but they should be able to take the same or similar alerting rule definitions (thresholds, boolean logic, ..), they mostly are about how the actual rules are executed and don't
change much about how rules are defined. Since polling is much simpler and should be able to scale fairly far this should IMHO be our initial focus.
Current state
The raintank/grafana version currently has an alerting package
with a simple scheduler, an in-process worker bus as well as rabbitmq based, an alert executor and email notifications.
It uses the bosun expression libraries which gives us the ability to evaluate arbitrarily complex expressions (use several metrics, use boolean logic, math, etc).
This package is currently raintank-specific but we will merge a generic version of this into upstream grafana. This will provide an alert execution platform but notably still missing is
an interface to create and manage alerting rules
state management (acknowledgements etc)
these are harder problems, which I hope to tackle with your input.
Requirements, Future implementations
First off, I think bosun is a pretty fantastic system for alerting (not so much for visualization)
You can make your alerting rules as advanced as you want, and it enables you to fine-tune over time, backtest on historical data, so you can get them just right.
And it has a good state machine.
In theory we could just compile bosun straight into grafana, and leverage bosun via its REST api instead of Golang api, but then we have less finegrained control and
for now I feel more comfortable trying out piece by piece (piece meaning golang package) and make the integration decision on a case by case basis. Though the integration
may look different down the road based on experience and as we figure out what we want our alerting to look like.
Either way, we don't just want great alerting. We want great alerting combined with great visualizations, notifications with context, and a smooth workflow where you can manage
your alerts in the same place you manage your visualizations. So it needs to be nicely integrated into Grafana. To that end, there's a few things to consider:
some visualized metrics (metrics plotted on graphs) are not alerted on
some visualized metrics are alerted on:
A: with simple threshold checks: easy to visualize alerting logic
B: with more advanced logic: (e.g. look at standard deviation of the series being plotted, compare current median against historical median, etc): can't easily be visualized nex
to the input series
some metrics used in alerting logic are not to be vizualized
Basically, there's a bunch of stuff you may want visualized (V), and a bunch of stuff you want alerts (A), and V and A have some overlap.
I need to think about this a bit more and wonder what y'all think.
There will definitely need to be 1 central place where you can get an overview of all the things you're alerting on, irrespective of where those rules are defined.
There's a few more complications which I'll explain through an example sketch of how alerting could look like:
let's say we have a timeseries for requests (A) and one for errorous requests (B) and this is what we want to plot.
we then use fields C,D,E to put stuff that we don't want to alert on.
C contains the formula for ratio of error requests against the total.
we may for example want to alert (see E) if the median of this ratio in the last 5min ago is more than 1.5 of what the ratio was in the same 5minute period last week, and also
if the errors seen in the last 5min is worse than the errors seen since 2 months ago until 5min ago.
notes:
some queries use different timeranges than what is rendered
in addition to processing by tsdb (such as Graphite's sum(), divide() etc which return series) we need to be able to reduce series to single numbers. fairly easy to implement (and in fact currently the bosun library does this for us)
we need boolean logic (bosun also gives us this)
in this example the expression only uses variables defined within the same panel, but it might make sense to include expressions of other panels/graphs.
other ponderings:
do we integrate with current grafana graph threshold settings (which are currently for viz only, not for processing) ? if the expression is a threshold check, we could automatically
display a threshold line
using the letters is a bit clunky, could we refer to the aliases instead? like #requests and #errors?
if the expression are stats.$site.requests and stats.$site.errors, and we want to have seperate alert instances for every site (but only set up the rule once)? what if we only want it for a select few of the sites. what if we want different parameters based on which site? bosun actually supports all these features, and we could expose them though we should probably build a UI around them.
I think for an initial implementation every graph could have two fields, like so:
where the expression is something like what I put in E in the sketch.
for logic/data that we don't want to visualize, we just toggle off the visibility icon.
grafana would replace the variables in the formula's, execute the expression (with the current bosun based executor). results (state changes) could be fed into something like elasticsearch and displayed via the annotations system.
Thoughts?
Do you have concerns or needs that I didn't addres?
As per http://docs.grafana.org/alerting/rules/, Grafana plans to track state per series in future releases.
"If a query returns multiple series then the aggregation function and threshold check will be evaluated for each series. What Grafana does not do currently is track alert rule state per series." and
"To improve support for queries that return multiple series we plan to track state per series in a future release"
But it seems like there can be use cases where we have graphs containing set of metrics for which different sets of alerts are required. This is slightly different from "Support per series state change" ( https://github.com/grafana/grafana/issues/6041 ) because
The action (notifications) can be different.
Also, tracking separate states of an alert is not always preferred (as the end-user would need to know the details behind the individual states ) vs just knowing if alert is triggered.
Hello dear developers team.
Thank you for a awesome product, but I have one problem.
Now there is a hadcoded US date format in time_format function.
The most annoying case is display month and day. When I see something like "2/3" I'm a bit confused. Is it "the second of Mart" or "the third of February"?
The most sad thing, that I can't to configure this behaviour.
Unfortunately the simplest way (and may be the most proper) doesn't help here. I mean toLocaleString with additional options. You can return different array of options instead of hardcoded format pattern and this method convert date in accordance with a right locale.
But in our case there is a jquery plot and it requires date format for converting timestamp by itself.
So, the second way is make some kind of mapping locale -> format array. Example. It seems a bit ugly. But it could be a single working solution.
May be I missed some obvious and better solutions. That's why I didn't create a pull request. =)
The timeseries panel still misses time region support... we need to find some way to support that. Rather than have the strict regions defined in panel JSON, it would be great if the regions were defined by a query (ie, "open/closed") in addition to fixed calendar based calculations.
This PR just extracts the timeRegion functions to a utility folder and adds a few types
Support for compact Explore URLs is deprecated and will be removed in a future release. Until then, when navigating to Explore using the deprecated format the URLs are automatically converted. If you have existing links pointing to Explore update them using the format generated by Explore upon navigation.
You can identify a compact URL by its format. Compact URLs have the left (and optionally right) url parameter as an array of strings, for example &left=["now-1h","now"...]. The standard explore URLs follow a key/value pattern, for example &left={"datasource":"test"...}. Please be sure to check your dashboards for any hardcoded links to Explore and update them to the standard URL pattern. Issue #50873
Login: Fix mismatching label on auth_module in user list. #49177, @Jguer
Playlists: Save button now correctly creates a new playlist. #50381, @ashharrison90
RBAC: Fix migrations running in the wrong order causing inheritance problem in enterprise. #50452, @gamab
RBAC: Fix migrations running into the wrong order. (Enterprise)
ServiceAccounts: Add identifiable token prefix to service account tokens. #49011, @Jguer
Traces: Fix missing CopyButton on KeyValueTables and overlapping of panels. #49271, @svennergr
Breaking changes
The @grafana/ui package helper function selectOptionInTest used in frontend tests has been removed as it caused testing libraries to be bundled in the production code of Grafana. If you were using this helper function in your tests please update your code accordingly:
// before
import { selectOptionInTest } from '@grafana/ui';
// ...test usage
await selectOptionInTest(selectEl, 'Option 2');
// after
import { select } from 'react-select-event';
// ...test usage
await select(selectEl, 'Option 2', { container: document.body });
Removed deprecated checkHealth prop from the @grafana/e2eaddDataSource config. Previously this value defaulted to false, and has not been used in end-to-end tests since Grafana 8.0.3. Issue #50296
Removes the deprecated LegacyBaseMap, LegacyValueMapping, LegacyValueMap, and LegacyRangeMap types, and getMappedValue function from grafana-data. Migration is as follows:
| Old | New |
| ------------- | ------------- |
| LegacyBaseMap | MappingType |
| LegacyValueMapping | ValueMapping |
| LegacyValueMap | ValueMap |
| LegacyRangeMap | RangeMap |
| getMappedValue | getValueMappingResult | Issue #50035
This change fixes a bug in Grafana where intermittent failure of database, network between Grafana and the database, or error in querying the database would cause all alert rules to be unscheduled in Grafana. Following this change scheduled alert rules are not updated unless the query is successful.
The get_alert_rules_duration_seconds metric has been renamed to schedule_query_alert_rules_duration_seconds. Issue #49874
Any secret (data sources credential, alert manager credential, etc, etc) created or modified with Grafana v9.0 won't be decryptable from any previous version (by default) because the way encrypted secrets are stored into the database has changed. Although secrets created or modified with previous versions will still be decryptable by Grafana v9.0.
If required, although generally discouraged, the disableEnvelopeEncryption feature toggle can be enabled to keep envelope encryption disabled once updating to Grafana v9.0.
In case of need to rollback to an earlier version of Grafana (i.e. Grafana v8.x) for any reason, after being created or modified any secret with Grafana v9.0, the envelopeEncryption feature toggle will need to be enabled to keep backwards compatibility (only from v8.3.x a bit unstable, from 8.5.x stable).
As a final attempt to deal with issues related with the aforementioned situations, the grafana-cli admin secrets-migration rollback command has been designed to move back all the Grafana secrets encrypted with envelope encryption to legacy encryption. So, after running that command it should be safe to disable envelope encryption and/or roll back to a previous version of Grafana.
Alternatively or complementarily to all the points above, backing up the Grafana database before updating could be a good idea to prevent disasters (although the risk of getting some secrets corrupted only applies to those updates/created with after updating to Grafana v9.0). Issue #49301
According to the dynamic labels documentation, you can use up to five dynamic values per label. Thereβs currently no such restriction in the alias pattern system, so if more than 5 patterns are being used the GetMetricData API will return an error.
Dynamic labels only allow ${LABEL} to be used once per query. Thereβs no such restriction in the alias pattern system, so in case more than 1 is being used the GetMetricData API will return an error.
When no alias is provided by the user, Grafana will no longer fallback with custom rules for naming the legend.
In case a search expression is being used and no data is returned, Grafana will no longer expand dimension values, for instance when using a multi-valued template variable or star wildcard * in the dimension value field. Ref https://github.com/grafana/grafana/issues/20729
Time series might be displayed in a different order. Using for example the dynamic label ${PROP('MetricName')}, might have the consequence that the time series are returned in a different order compared to when the alias pattern {{metric}} is used
In Elasticsearch, browser access mode was deprecated in grafana 7.4.0 and removed in 9.0.0. If you used this mode, please switch to server access mode on the datasource configuration page. Issue #49014
Environment variables passed from Grafana to external Azure plugins have been renamed:
AZURE_CLOUD renamed to GFAZPL_AZURE_CLOUD
AZURE_MANAGED_IDENTITY_ENABLED renamed to GFAZPL_MANAGED_IDENTITY_ENABLED
AZURE_MANAGED_IDENTITY_CLIENT_ID renamed to GFAZPL_MANAGED_IDENTITY_CLIENT_ID
There are no known plugins which were relying on these variables. Moving forward plugins should read Azure settings only via Grafana Azure SDK which properly handles old and new environment variables. Issue #48954
Removes support for for ElasticSearch versions after their end-of-life, currently versions < 7.10.0. To continue to use ElasticSearch data source, upgrade ElasticSearch to version 7.10.0+.
Issue #48715
Application Insights and Insight Analytics queries in Azure Monitor were deprecated in Grafana 8.0 and finally removed in 9.0. Deprecated queries will no longer be executed. Please refer to the documentation for more information about this change.
grafana/ui: Button now specifies a default type="button"
The Button component provided by @grafana/ui now specifies a default type="button" when no type is provided. In previous versions, if the attribute was not specified for buttons associated with a <form> the default value was submit per the specification
You can preserve the old behavior by explicitly setting the type attribute: <Button type="submit" />
The Rename by regex transformation has been improved to allow global patterns of the form /<stringToReplace>/g. Depending on the regex match used, this may cause some transformations to behave slightly differently. You can guarantee the same behaviour as before by wrapping the match string in forward slashes (/), e.g. (.*) would become /(.*)/ Issue #48179
<Select /> menus will now portal to the document body by default. This is to give more consistent behaviour when positioning and overlaying. If you were setting menuShouldPortal={true} before you can safely remove that prop and behaviour will be the same. If you weren't explicitly setting that prop, there should be no visible changes in behaviour but your tests may need updating. Please see the original PR (https://github.com/grafana/grafana/pull/36398) for migration guides. If you were setting menuShouldPortal={false} this will continue to prevent the menu from portalling.
Grafana alerting endpoint prefixed with api/v1/rule/test that tests a rule against a Corte/Loki data source now expects the data source UID as a path parameter instead of the data source numeric identifier. Issue #48070
Grafana alerting endpoints prefixed with api/prometheus/ that proxy requests to a Cortex/Loki data source now expect the data source UID as a path parameter instead of the data source numeric identifier. Issue #48052
Grafana alerting endpoints prefixed with api/ruler/ that proxy requests to a Cortex/Loki data source now expect the data source UID as a path parameter instead of the data source numeric identifier. Issue #48046
Grafana alerting endpoints prefixed with api/alertmanager/ that proxy requests to an Alertmanager now expect the data source UID as a path parameter instead of the data source numeric identifier. Issue #47978
The format of log messages have been updated, lvl is now level and erorand dbug has been replaced with error and debug. The precision of timestamps has been increased. To smooth the transition, it is possible to opt-out of the new log format by enabling the feature toggle oldlog. This option will be removed in a future minor release. Issue #47584
In the Loki data source, the dataframe format used to represent Loki logs-data has been changed to a more efficient format. The query-result is represented by a single dataframe with a "labels" column, instead of the separate dataframes for every labels-value. When displaying such data in explore, or in a logs-panel in the dashboard will continue to work without changes, but if the data was loaded into a different dashboard-panel, or Transforms were used, adjustments may be necessary. For example, if you used the "labels to fields" transformation with the logs data, please switch to the "extract fields" transformation. Issue #47153
Deprecations
setExploreQueryField, setExploreMetricsQueryField and setExploreLogsQueryField are now deprecated and will be removed in a future release. If you need to set a different query editor for Explore, conditionally render based on props.app in your regular query editor. Please refer to https://grafana.com/docs/grafana/latest/developers/plugins/add-support-for-explore-queries/ for more informations.
Issue #48701
Plugin development fixes & changes
Chore: Remove react-testing-lib from bundles. #50442, @jackw
Loki: Fix uncaught errors if labelKey contains special characters. #49887, @svennergr
Prometheus: Fix aligning of labels of exemplars after backend migration. #49924, @aocenas
SharePDF: Fix repeated datasource variables in PDF. (Enterprise)
State Timeline: Fix Null Value Filling and Value Transformation. #50054, @codeincarnate
Usage stats: Divide collection into multiple functions to isolate failures. #49928, @sakjur
Breaking changes
Removes support for storing/using datasource password and basicAuthPassword unencrypted which was deprecated in Grafana v8.1.0. Please use secureJsonData.password and secureJsonData.basicAuthPassword. Issue #49987
Removes the option to instrument HTTP request in Grafana using summaries instead of histograms. Issue #49985
Removes support for deprecated dataproxy.max_idle_connections_per_host setting. Please use max_idle_connections instead. Issue #49948
Removes the deprecated getFormStyles function from grafana-ui.
Prefer using GrafanaTheme2 and the useStyles2 hook. Issue #49945
The configuration options auth.login_maximum_inactive_lifetime_days and auth.login_maximum_lifetime_days were deprecated in Grafana v7.2.0 and have now been removed. Use login_maximum_inactive_lifetime_duration and login_maximum_lifetime_duration to customize the maximum lifetime of a login session. Issue #49944
Removed the deprecated isFocused and isInvalid props from the InlineLabel component. These props haven't done anything for a while, so migration is just a matter of removing the props. Issue #49929
Removed the deprecated onColorChange prop from ColorPicker. Moving forward the onChange prop should be used. Issue #49923
/api/tsdb/query API has been removed. Use /api/ds/query instead.
Issue #49916
onClipboardCopy and onClipboardError APIs have been changed such that the callback's argument is just the text that's been copied rather than the old ClipboardEvent interface.
Migration should just be a matter of going from
users.teams:read -> replaced by users.read + teams:read
We've added a migration from the old action names to the new names and have updated our documentation. But you will have to update any scripts and provisioning files that are using the old action names. Issue #49730
The following RBAC action renames have been carried out:
reports.admin:write -> reports:write;
reports.admin:create -> reports:create;
licensing:update -> licensing:write;
roles:list -> roles:read;
teams.roles:list -> teams.roles:read;
users.roles:list -> users.roles:read;
users.permissions:list -> users.permissions:read
We've added a migration from the old action names to the new names and have updated our documentation. But you will have to update any scripts and provisioning files that are using the old action names. Issue #3372
Preferences: Fix updating of preferences for Navbar and Query History. #49677, @ivanahuckova
TimeRange: Fixes issue when zooming out on a timerange with timespan 0. #49622, @JoaoSilvaGrafana
Variables: Fixes DS variables not being correctly used in panel queries. #49323, @JoaoSilvaGrafana
Breaking changes
Drop support for deprecated setting ldap_sync_ttl under [auth.proxy]
Only sync_ttl will work from now on Issue #49902
Removes support for deprecated heading and description props. Moving forward, the Card.Heading and Card.Description components should be used. Issue #49885
Removes the deprecated link variant from the Button component.
To migrate, replace any usage of variant="link" with fill="text". Issue #49843
Removes the deprecated surface prop from the IconButton component. This prop hasn't actually done anything for a while, so it should be safe to just remove any instances of its usage.
Issue #49715
Removes the deprecated TextDisplayOptions export from @grafana/data in favor of VizTextDisplayOptions from @grafana/schema. To migrate, just replace usage of TextDisplayOptions with VizTextDisplayOptions. Issue #49705
Removed support for the deprecated getColorForTheme(color: string, theme: GrafanaTheme) function in favor of the
theme.visualization.getColorByName(color: string) method. The output of this method is identical to the removed function, so migration should just be a matter of rewriting calls of getColorForTheme(myColor, myTheme) to myTheme.visualization.getColorByName(myColor).
Issue #49519
In the Prometheus data source, for consistency and performance reasons, we changed how we represent NaN (not a number) values received from Prometheus. In the past versions, we converted these to null in the frontend (for dashboard and explore), and kept as NaN in the alerting path. Starting with this version, we will always keep it as NaN. This change should be mostly invisible for the users. Issue #49475
Webpack 5 does not include polyfills for node.js core modules by default (e.g. buffer, stream, os). This can result in failed builds for plugins. If polyfills are required it is recommended to create a custom webpack config in the root of the plugin repo and add the required fallbacks:
We have changed the internals of backendSrv.fetch() to throw an error when the response is an incorrect JSON.
// PREVIOUSLY: this was returning with an empty object {} - in case the response is an invalid JSON
return await getBackendSrv().post(`${API_ROOT}/${id}/install`);
// AFTER THIS CHANGE: the following will throw an error - in case the response is an invalid JSON
return await getBackendSrv().post(`${API_ROOT}/${id}/install`);
When is the response handled as JSON?
If the response has the "Content-Type: application/json" header, OR
If the backendSrv options (BackendSrvRequest) specify the response as JSON: { responseType: 'json' }
How does it work after this change?
In case it is recognised as a JSON response and the response is empty, it returns an empty object {}
In case it is recognised as a JSON response and it has formatting errors, it throws an error
How to migrate?
Make sure to handle possible errors on the callsite where using backendSrv.fetch() (or any other backendSrv methods). Issue #47493
Statsview is a real-time Golang runtime stats visualization profiler. It is built top on another open-source project, go-echarts, which helps statsview to show its graphs on the browser.
Skydive is an open source real-time network topology and protocols analyzer. It aims to provide a comprehensive way of understanding what is happening in the network infrastructure.