Data Flow Diagrams!

Go with the flow.

Introduction

Data flow diagrams build upon the Context diagram. It allows you to start looking at your system and understanding it by breaking it down into it's main processes and describing how they relate to each other. In doing so, they are one of the first steps in a process called Top down design. This is where you break the problem down into successively smaller pieces until you have a series of smaller, simpler problems to solve (putting these pieces back to solve the original larger problem).

Data Flow diagrams

A data flow diagram consists of a series of external entities (rectangles), processes (circles) and storage (rectangles with an open side) joined together with data flows (lines with arrows). Data flow diagrams are also commonly referred to as DFD's. Here is an example:

Data flow diagram example

In this example we are illustrating how data moves between the main processes in a simple social media system. In the diagram the User sends credentials to a Login process which checks the details against data stored in User Accounts database. The Login process sends User details to the the two processes Make post and View posts. These two processes put data into the Posts database and retrieve it as well.

For more information on External entites and data flows, see the page on Context diagrams.

A process is an aspect of your system which takes in data from various sources and processes it to create new data.

Storage is anything which holds data for later use. It could be a database, a spreadsheet, document or some other file.

A data flow diagram is a model of a system. A model is a representation of the system (usually simplified in a targetted way), that is used to help you better understand the thing. A data flow diagram aims to be a model of what data is present in a system (or needs to be present) and how it is modified. It doesn't say anything about how the data is modified, when, where or why. It shouldn't say anything about how a solution will or may be implemented, only what it needs to achieve.

For instance, we haven't said anything about whether this system will be an app, or a website, or both. Will it be running on a Windows or Linux server? What language will it be created in etc. Those questions we will look at later but are of no real concern to us right now.

It can be very difficult to think about your problem from this limited perspective, and keep your model fully separated from any preconceived notions you may have about a solution, but doing so in a rigourous way is a powerful means to really understand your problem at a much deeper level.

This diagram above is missing a lot of details and is quite simplified. We have aimed to capture the main, essential elements of the system. There aren't really any rules regarding what to include or not include in these diagrams. Remember that these diagrams are created to aid you in creating a system (or solution). You have to work out what needs to be modelled in these diagrams to help you gain a better understanding, and what is superfluous and only going to distract you. Figuring this out is difficult but with practice you will get better at it.

For instance, our final solution probably needs an account creation mechanism, a forgotten password mechanism, we have left out details of whether a message can include images and video etc and this has allowed us to get a cleaner picture of the core of the functionality.

A good way to guage if your diagrams are too complex (detailed) or not is to show your diagrams to a non technical person who knows little (or nothing) about what you are modelling with the DFD. If they can look at it and effectively describe the system you are aiming to model then you are on the right track.

Multilevel Data Flow Diagrams

If the system you are aiming to model is simple enough, and you are intending to create a solution using an end user or rapid application type development approach, you probably don't need to go into great detail in your data flow diagram as a lot of the processing is going to be fixed and you are really just integrating existing, off the shelf products. A single level diagram like the one above will suffice. If you are building a more complex system, and / or writing the code yourself, then you will probably benefit from understanding and modelling the processes and data flows to a lower level. We don't want to do this by having many processes and data flows on our diagram however, that would make it unweildly and hard to read. What we do instead is have several levels of diagrams and break main processes down into sub processes.

Let's say, for instance, we are creating a system that will accept two dates (in either long form, 28th April 2024 or short form, 28/4/2024) and tell us how many days apart they are. We might model this with the following diagrams:

Date example context diagram

You should always start off with a Context diagram.

Date example DFD Level 1

Each process is numbered and the inputs and outputs from the User are the same as for the Context diagram.

Date example DFD Level 2 Validate dates

The inputs and outputs of the parent process are included but don't attach to anything. Also note the numbering format used for the processes.

Date example DFD Level 2 Convert dates

This diagram also matches the inputs and outputs of it's parent process.

To ensure maximum readability of our models, and taking into account chunking theory, you should aim to have between 3 and 7 processes on each individual data flow diagram.

It is important to number the processes, this allows us to easily match the diagrams between the different levels. Each lower level data flow diagram is effectively like opening the lid on an upper level process and looking inside. In this way, the upper level diagrams are broad overviews of the system and lower level diagrams provide more details on specific processes.

You don't have to go down to the same level on each process. You can go down as many levels as you need to adequatly represent the processing. Just keep ammending digits. eg. if we decided that the process validate days 1.2 needed further breaking down we could create a new diagram with processes labelled 1.2.1, 1.2.2 etc.

Balancing diagrams

All of the Data flow diagrams we create (including the Context diagram) need to act as one coherent representation of our system / problem. For this to be the case, all of the diagrams must be balanced. That is, the inputs and outputs on one diagram must match up with the inputs and outputs represented on the parent process. This is illustrated in the diagrams above.

Once you have finished a diagram, take a moment to check this aspect of them. You may need to add a data flow to the parent diagram, or maybe a data flow on this diagram isn't actually needed. Often these inconsistencies lead to having to review your thinking about some aspect of the system and you end up with a better understanding as a result.

Creating your Data flow diagrams

Creating Data flow diagrams can be a daunting task. With a methodical approach the task can be made a lot smoother. Don't expect to create a perfectly accurate set of diagrams in your first attempt. It's not uncommon to discover that you need to go back and change things as your understanding of the problem / system increases.

Creating your diagrams in a digital form makes it easy to go back and ammend them as you inevitably will.

Step 1: Create a Context diagram of your system.

Step 2: Start creating a level 1 Data flow diagram. Start by including the external entities from your context diagram around the outside.

Step 3: Walk through the functioning of your system and note down each major step as a process.

Step 4: Take the data flows from the Context diagram and add them in to your Data flow diagram.

Step 5: Think about what data may need to be stored and add in data stores for these.

Step 6: Progressively work through each process, first looking at what it needs to output, then what data it needs to produce that output and where it is going to come from (another process, storage, external entity).

Step 7: Review your diagram. Does it adhere to the rules listed below? Is it missing anything or is there anything there which possibly doesn't need to be? Is the diagram balanced with it's parent?

Step 8: For each process that you think needs it, create a lower level data flow diagram repeating steps 3 - 7.

The rules

In producing accurate and correct data flow diagrams the following rules should be observed:

A process must have at least one input. A process without any inputs is a miracle. Essentially it is not possible to produce outputs without some inputs. A common question is "What about a process which just produces a random number? Surely such a process could produce outputs without requiring inputs?" Such processes are not really complex enough to be processes, they are really just functions (and are built in within most programming languages).

A process must have at least one output. A process without any outputs is a black hole. If a process does not produce any outputs then it effectively serves no purpose and has no need for existence.

A grey hole is a process whose inputs are insufficient to produce it's output. This is common and something you want to weed out and ammend if you want your model to be as robust as possible. What you want to ask yourself is "if I was given that data, and just that data, by hand could I produce all the required output?".

Processes change data and as such, the data leaving a process must be different to the data entering it. eg. if you have a process with a single input payment data and a single output which is also payment data, the process has not done anything to the data and as such has no real need for existence. A more valid scenario would see the input as payment data and the output as validated payment data for example.

Process names should start with a verb. As processes do things to data they should be named starting with a verb which describes what it is that they do. eg. Sort names, Validate form, Calculate averages.

Data flow names should be nouns, singular and as descriptive as possible. Adjectives and adverbs should be used to describe how processing has changed a data flow. eg. an order may flow from a Customer as a new order and flow through a process coming out as an unfilled order.

All data flows must either begin or end at a process (or both). A process may not go between external entities or directly between an external entity and storage.

All data flows must be named. Otherwise we have no idea what actual data is flowing.

A process should only have inputs which are essential to producing the outputs. Getting the data flows down to the essential can be tricky but produces a cleaner and simpler model which will lead to a more robust solution.

External entities should be placed around the outside of a DFD. They are external to your system and it makes sense to replicate this fact in your model. You may include an external entity multiple times on your data flow diagram to help reduce the messiness of lines and make the diagram easier to read.

Data stores should be named as plural. They should describe the data entity being stored. Don't use terms such as database, or file, etc. Using these terms starts to indicate how a solution may be implemented and at this stage we should not be discussing that.

External entities should be listed as singular not plural. There may be a group of people (or things) which the external entity is representing but we are modelling an individual instance of them and so it is listed as singular. eg. we may have a system which interacts with many clients but in our diagram our external entity would just be labelled as 'Client'.

Data flows can be general rather than specific. eg. we should list a data flow as payment details rather than listing all the individual elements of the payment details. This is meant to be a broad, high level document and if we tried to list all the specific data the diagrams would most likely become way too cluttered to easily read. It would also take too much time and effort trying to get all the data accurate and complete which would distract us from the main purpose of the diagrams. We will deal with more specific data in later diagrams.

It is also ok to leave minor ancilliary data flows out of the diagram (for similar reasons as stated in the previous paragraph). I would recommend you consider this only when modelling a larger, more complex system however.

Use descriptive names for everything (external entities, data flows, processes, storage). It should be obvious what they are without needing further explanation.

Model data flows only, not actions. eg. your diagrams should not have data flows such as 'menu selection'. Determining what is data and what is an action can sometimes be tricky (when modelling a game in particular).

Try to layout your diagrams so as to minimise the number of data flows which cross over each other. Sometimes it is inevitable they will need to cross over but with a bit of forethought you should be able to place your external entities, processes and storage to minimise this. Doing so will make the diagrams easier to read.

Examples

The following are examples of Data flow diagrams that I have come across over the years which I have found interesting (generally because I have disagreed with some aspect of them). It is important to note, however, that these are just my opinions, other people may disagree with them).

Example 1

Data flow diagram example 2

There are a few things that are wrong here. The first is that the processes aren't really processes. They are pages within our intended solution. We will interact with pages in the solution (most likely) to cause the processing but the pages themselves aren't the actual processing. It would be better, for instance, to rename the Booking form process to be something like Record booking. By naming the processes as pages we are starting to dictate what our solution will look like and we start locking ourselves in. Another problem is that there may be more than one element of processing going on on a particular page, or processing which is not tied to a particular page

The Main page process isn't really a process either. It is not manipulating data to create new data. If you look at the outputs and input to the process, they are more actions than data. I would say this process doesn't need to be here. Yes, a main page may be required in the final solution but we are not modelling that aspect of the system here.

The data flow Free seats image is correct in that it is data and it is sent to the user but I would not include the word image here as again to do so is implying how the solution will be constructed. This locks us into a certain way of looking at how it will be done without considering that there may be better alternatives.

Finally, the Mouse movements data flow can be named more appropriately as well. It it's current form it is an action and again, is leading to preconceived ideas about what the solution may look like. Consider what data is conveyed by way of the action however. A selection of seats will be made via the action so a better label would be Seat selection.

Example 2

Data flow diagram example 3

In this example we have introduced a process for the Help page. This is still poor form but it seems like the only way to include the data flow for Help documentation. On first glance, the data flow for Help documentation seems reasonable. It is definitely data and the user does receive it. What process leads to it getting to the user though if it's not a page? When you get into a situation like this where you can't seem to find a process that fits in quite right, it is generally a clue that something is there that shouldn't be. In this case, although the documentation is data, it is static data, it is not processed in any way. You could argue, for instance, that the documentation may be stored as HTML and a process is required to render it before viewing by the user but this is not really processing the data, it is just superficially changing the look of the data. It is processing, but not of the nature that we wish to model in our Data flow diagrams.

Example 3

Data flow diagram example 4

In this example we are modelling a game which will be played on a games console. We have made a mistake by considering the Game controller and Television to be external entities. In a sense, they do provide and recieve data but from the point of view of what we are modelling they are not external entities but rather the conduits through which an external entity (the player) interacts with our system. It has also lead to the issue that the data sent from the game controller is technically accurate but does not describe what those button presses represent. (one might consider that these button presses are really actions moreso than data as well) As such we have a model which does not help us terribly much in terms of understanding the situation. Here is a better model of the system:

Data flow diagram example 4 corrected

Another interesting thing to note here is that the processes aren't directly connected. They interact through the data stores. Sometimes we have systems like this where processes are independent but act upon common data. For instance, if this was a web based game, a player could be looking at the high scores table at any point in time, before, during or after another player actually playing the game.

We could also have decided to have the Play game process send it's score to the High scores process which could then decide whether the score was high enough to require storing in the High scores database or not. Or maybe we need to add another process which manages and decides this. All would be valid interpretations of the scenario and this just goes to illustrate that there is no single correct model for a system. Considering these alternative representations and discussing them is a great way to gain a deeper, more valuable understanding of the problem and this should ultimately lead to creating a better solution.

The big picture

A good set of Data flow diagrams builds off your Context diagram and starts to break down your system / problem into it's major components, looking specifically at data requirements. These can then be further defined using IPO tables or structured into a more rigourous solution by way of a Structure chart which can then be elaborated on as a set of algorithms. Data flow diagrams are an effective step in breaking your unwieldly problem / solution down into manageable chunks which you can then start solving.

A data flow diagram is also a document you should discuss with your client / stakeholders to help ensure you are working on exactly what it is intended you should be working on (and not a misinterpretation of it).

Summary

Process
A step in our system / problem in which data is modified. Represented as a circle.
External entities
Things which our system interacts with either sending or receiving information or data. Represented as a rectangle.
Storage
A database, file or other means by which data is stored and retrieved within our system. Represented as a rectangle with the right side removed.
Data flows
Data which moves to and from an external entity, process or storage. Represented as a line with an arrow.
Balancing
Make sure data flows in and out of each diagram match those of the parent process.
Follow the rules
This might seem tedious and unnecessary but adhering to the rules strictly will ensure your model is an effective entity to start designing and building your sytem / solution from.
Layout
Take the time to ensure your diagrams are spaced out well and easy to read. The more readable they are the easier it is for you to gain value from them.