Organizations have modernized their business intelligence architecture by moving analytics workloads into the cloud, opening doors for leveraging other cloud services to gain deeper insights from data. The regular introduction of new cloud services has made machine learning a hot topic. Adopting complex processes, such as machine learning, into an enterprise’s data pipelines has never been easier. Here is a cloud-based approach organizations can take to leverage machine learning to apply sentiment analysis to Twitter.
Integrating Machine Learning with a Cloud-Based Business Intelligence Architecture
Organizations aspire toward a modern cloud-based business intelligence (BI) architecture to achieve speed, simplicity, and scale. Leveraging new cloud services for complex operations such as machine learning can yield all of these benefits.
For convenience, one might choose a machine learning service from the same cloud provider that the BI stack currently lives in. If for example, that is AWS, then Amazon Comprehend is a great choice. If on Google Cloud Platform, Google Natural Language is a great choice. It is also important to note that you are not limited to services in the cloud platform your BI stack resides in. Most of these types of services are easy to set up and integrate into a company’s architecture, and can help glean insights from data faster.
Asynchronous Integration
Cloud data warehouses (CDWs) are well suited for large scale data workloads, especially when working with large sets of data at a time. A CDW can handle these volumes of data, making it a great platform for performing those large data transformation workloads, and the performance can scale as an enterprise’s data sets grow.
A business can implement a synchronous or asynchronous integration when using Amazon Comprehend for sentiment analysis. A synchronous integration works by sending individual tweets to Amazon Comprehend and waiting for a response that includes the derived sentiment of the tweet. A common use case for a synchronous integration would be real-time sentiment analysis into a transactional system, such as a website’s recommendation engine. An asynchronous integration works by sending a set of tweets for Amazon Comprehend to process. When complete, the results are then made available in one or more flat files that can then be consumed. An asynchronous integration with Amazon Comprehend makes the most sense in this case, where scalability and speed of processing large volumes of data is of the utmost importance.
Get Tweets from Twitter for Sentiment Analysis
Step one is to load tweets from Twitter into a CDW. Some Extract, Transform and Load (ETL) tools have native components to use Twitter as a source of data, which can make this step very easy to implement.
Twitter API Limits
Typically, Twitter’s APIs are used as the main integration point for fetching data from Twitter. ETL tools that have a native connector to Twitter will typically use Twitter APIs. The connector offered by the ETL tool helps to simplify this API integration. When designing an automated process for fetching tweets, Twitter’s Standard API endpoints rate limits need to be accounted for. When using a native Twitter connector from an ETL tool, check to see if this rate limit can be accounted for. This is often a property that can be set in an ETL tool’s Twitter connector.
Preparing and Sending Tweets for Sentiment Analysis
In this exercise, a business could use Amazon Comprehend to perform sentiment analysis on the captured tweets. As mentioned earlier, to ensure scalability and speed, it would make sense to integrate with Amazon Comprehend using an asynchronous method, specifically using the StartSentimentDetectionJob. To get data to Amazon Comprehend using this method, put a file in an S3 bucket and then execute an Amazon Comprehend job to process it. When Amazon Comprehend is done analyzing the data set, it will drop an output file into a specified S3 bucket.
Two input formats can be defined when creating the input file for Amazon Comprehend. An enterprise could use the ONE_DOC_PER_LINE format, which means that each line in the input file will be an entire document, and Amazon Comprehend will score the sentiment of that document as a whole. In our case, each line of the file will be a single tweet. Additional data (like the tweet id or search term) is not included in the input file, but a business could use the file name to help pass through some metadata about the tweets being analyzed.
Thinking about how to visualize the analyzed data and wanting to keep the analysis somewhat anonymous, an organization would then prepare Amazon Comprehend input files where each input file represents the captured tweets for a particular search term by date. Segmenting the data in this way can be done very easily using standard cloud data warehouse features. Each segment of tweets can then be used to create an output file of tweets for Amazon Comprehend to analyze.
Once all of the Amazon Comprehend input files have been created, an Amazon Comprehend job can be triggered to analyze the contents of the generated files.
Here is a link to the documentation for the AWS CLI Comprehend command that can be used to trigger the Amazon Comprehend job programmatically, start-sentiment-detection-job. Note the AWS documentation that outlines the IAM (permissions) required to execute the Amazon Comprehend job.
Receiving and Loading Sentiment Analysis Output
A separate ETL job to load and transform the data received will need to execute when the Amazon Comprehend job completes. Typically, when integrating with an asynchronous process, a user would typically choose a “polling” mechanism to determine when results are ready to load, or an “event-driven” mechanism that will execute once results are available. Some cloud native ETL tools can integrate directly with other cloud services, such as AWS Simple Queue Service (SQS) to implement an event-driven pattern where the event, in this case, would be Amazon Comprehend placing an output file into an S3 bucket. This can be the event that triggers the job to ingest and process the Amazon Comprehend output file.
The job that ingests and transforms the Amazon Comprehend output data would be designed as a simple 3 step process.
- Step 1: Prepare the file to be loaded into the cloud data warehouse
- Step 2: Load the data into the cloud data warehouse
- Step 3: Transform the data into an analytics-ready state
The data in the Amazon Comprehend output file is in JSON format. Each JSON element represents an analyzed tweet from one of the Amazon Comprehend input files. Details such as the filename of the input file and the sentiment analysis score are embedded within the JSON information. Using your favorite ETL tool, parse this JSON output data from Amazon Comprehend into target tables in your CDW. These target tables represent the analytics-ready data against which you can now point your favorite reporting tool.
Twitter Sentiment Analysis Dashboard
Here is an example of a simple interactive Tableau dashboard that helps to demonstrate the types of insights an enterprise can generate by following this process! This Tableau Dashboard shows the trend of tweets related to mentions of some NFL teams (go Birds!) and some popular TV shows. This could easily be used to track an organization’s own business-related hashtags and summarize the trends around what people are saying around its brand.
The power and performance of cloud data warehouses helps provide businesses with the ability to glean valuable insights from large sets of data using data transformation software to join together data. Data transformation enables use cases like machine learning, artificial intelligence, modeling and more so you can achieve faster time to insights and make faster, data-driven decisions for your business.
Feature image via Pixabay.