Basics

Oscar's dataworkshop provides an interface for users to use, queue and schedule some of Oscar's advanced and powerful features.

Processes

Using Oscar's dataworkshop users can easily perform a variety of processes on their chosen dataset, these processes can be queued and scheduled to run as a script.

When you first open the 'Data Workshop' tab you will be presented with a list of the datasets being used in your current dashboard. In the top right corner you have the option to add a new dataset or to begin creating the processes in conjunction with one of the current datasets. Click 'Load Dataset' to add a dataset from your drive or 'Process Data' to continue to the data workshop features. Naming the new script or loading a previous one from the pop up box will open a new menu where you can select the dataset you want to use and the processes you want to apply.

Below is a list of the various processes you can currently perform in Oscar.

Joins

Joins allows users to merge datasets by joining columns selected by the user. For example when creating the join you choose one or more columns from each dataset that will be joined i.e. Customer_ID = customerId. The end result of the join depends on the type of join the user selects. The options are inner, outer, left outer, right outer and left semi.

Inner joins: An inner join will only return a row if the join column is available in both datasets.

Outer joins: An outer join will return every row of both datasets but if one dataset does not have a value in the join column it will be null for the columns of that dataset.

Left Outer joins: Left and Right outer are in between so for left it will return every row in the left dataset and match them to the right dataset.

Right Outer joins: Left and right outer are in between so for right it will return every row in the right dataset and match them to the left dataset.

Left Semi joins: A left semi join is essentially a strict inner join, with a normal inner join if you have two matching ids on the right dataset it will repeat the row on the left dataset for both, with a left semi it will only return the first match on the right for each row on the left. Also for the left semi, if there are two rows with the same id on the left dataset but one on the right the right one will be repeated for each on the left, the same as in the inner join.

Aggregate

The aggregate function creates a new summary table of selected numerical variables across multiple categorical variables.

To perform an aggregate first select the base data set, then pick a categorical variable you want to aggregate. Next pick the numerical variable you want to be summarised as well as the kind of output e.g. sum, count, average. Add to process and name the new dataset to finish.

Expression

The expression builder helps you performs various operations including relation, arithmetic, logical, mathematical, string etc. on the columns in the dataset.

Users can create additional columns by using the expression builder in Oscar. Enter the expression into the box provided and then name the new column. You can also preview the output, if satisified add to processes and then name the new dataset.

Event Builder

The Event Builder process creates a new dataset by identifying the date/time that a specified 'event' occurs and assessing how much time has passed since the nearest 'event' for each data point (row). The result can be calculated in three ways: 'Counter', 'Timespan' or 'Identifier'; the resulting columns will be appended to the base dataset to create the new dataset.

For all three methods the 'New column name', 'Date column' and the 'Event Expression' must be specified. The 'Event Expression' is a Boolean expression that will be evaluated for each data point to assess whether this data point constitutes an 'event'. The 'Key column' is also an optional field available in all three options, which acts similar to a group by for the 'Date column'; for example, for a dataset containing customer data over time where 'events' for/from one customer do not affect another the 'Key column' can be used to separate the data into individual customers.

For the 'Counter' option the process will return a numeric column containing the number of specified periods away from the nearest 'event' the data point is. The period can be set as: years, months, days, hours, minutes or seconds. A horizon value can also be specified where if the data point is more than the horizon periods from the nearest 'event' the return value will be null. The final input for 'Counter' is the 'Look backwards' check box, this specifies whether data points that occurred before the 'event' will be considered for that event, if selected as true and a data point sits equidistant from two 'events' the value for the 'event' that occurred before the data point will be returned.

The 'Timespan' option is similar to the 'Counter' but returns a categoric field with the timespan since the corresponding event returned in the given format. For the format, date/time notation enclosed within `< >` will be replaced by the relevant value. E.g. <Y>, <M>, <MM>, <D>, <DD>, <h>, <hh>, <m>, <mm>, <s>, <ss>. Two letters will add a '0' to the front and take the last two digits, e.g. '1' -> '01' (n.b. this means that two letters should only be used when the result is guaranteed to be two or less digits i.e. '123' -> '23').

The 'Identifier' option returns a value based on a second Boolean expression. If the data point is within the horizon of an 'event' the 'Identifier expression' will be evaluated for this data point. By default, this will return a categoric column containing 'true', 'false' and null values where the 'Identifier expression' resolves to true, false or the data point is outside the 'event horizon' respectively. It is also possible to specify values to be used in place of 'true' and 'false'.

Pivot Table

The Pivot Table process creates a new dataset where the columns are the group-by fields followed by the categories of the pivot field. The values for each of the columns that denote the categories of the pivot field are calculated using the aggregation field for the group specified by the categories of the group-by fields in that row.

Apply Model

The Apply Model process creates a new dataset by applying a previously created model to the base dataset and appending the resulting column. To apply a model the base dataset needs to contain columns that are equivalent to the columns used to create the model. The first step when using the data workshop interface is to select the 'Load Model' button and then load the model to be applied. The columns used to create the model must then be mapped to the equivalent columns in the base dataset; data workshop will attempt to map the columns by mapping columns with the same name. The default output column name will be the name chosen when creating the model but this can be changed at this stage.

Bin

The Bin process works similarly to the Aggregate process but allows a numeric field to be split into bins to create the aggregation groups. The data is also split into the categories of the 'Group by field' and a column is created for each of the possible aggregations on the 'Y Axis Value' field. The aggregations are mean, standard deviation, max, min, sum, size.

If a categoric field is used the data will be grouped into the categories of this field and any other 'Group by fields' given and the count for each group will be returned.

Union

Using union function you can quickly combine multiple datasets with the same structure and create a new consolidated dataset.

Just select the datasets you want to union from the drop down and select add to proccess.

Select

The Select process creates a new dataset containing a subset of the columns from the input dataset; additionally, the selected columns can be renamed. Using the data workshop interface the user can select multiple columns to be included in the new dataset by selecting the '+' button and each of these has an optional field to rename the column.

Where

The Where process creates a new dataset by applying a filter to the base dataset. The data workshop interface allows a simple where filter to be created by applying a simple expression to the selected column. Multiple expressions can be created by selecting the '+' button; these will be connected using an OR.

n.b. multiple Where processes can be strung together, using the dataset created in the previous process as the base dataset in the subsequent one, to achieve AND connectivity.

Text Analysis

The Text Analysis function creates a model to predict the target column by using an input variable containing text. You can further run the model created on another dataset to get the predict values of target column using the same input variable.

Replace NA

The Replace NA process creates a new dataset by replacing null values with a given replacement value. Using the data workshop interface, this can be done in two ways either by giving a numeric replacement value and a categoric replacement value and applying this to all the columns in the base dataset or by selecting the 'Customise' button and choosing a replacement value for each column individually. If the 'Customise' method is used the options selected for the categoric and numeric replacement values will be preselected for each column. It is also possible to only consider numeric or categoric columns using the check boxes or to deselect individual columns in the 'Customise' page.

The categoric replacements available are the mode or a literal value.

The numeric replacements available are the mean, max, min, standard deviation or the count.