Skip to main content

Unit information: Large-Scale Data Engineering in 2020/21

Please note: you are viewing unit and programme information for a past academic year. Please see the current academic year for up to date information.

Unit name Large-Scale Data Engineering
Unit code EMATM0051
Credit points 20
Level of study M/7
Teaching block(s) Teaching Block 1 (weeks 1 - 12)
Unit director Mr. Alan Forsyth
Open unit status Not open
Pre-requisites

None

Co-requisites

Software Development: Programming and Algorithms or MATHM0039.

Technology, Innovation, Business, and Society

School/department School of Engineering Mathematics and Technology
Faculty Faculty of Engineering

Description including Unit Aims

This unit aims to give a comprehensive overview of elastically scalable and remotely-accessed "cloud" computing services such as those offered by Amazon, Google, and Microsoft, and associated technologies for dealing with very-large-scale bodies of data. The unit commences with discussion of the economics that have driven the rapid development and adoption of cloud computing in a variety of industries; it then explores the provisioning of cloud services moving from infrastructure-as-a-service (IaaS), through platform-as-a-service (PaaS), software-as-a-service (SaaS), and "serverless" functions-as-a-service (FaaS). The open-source Hadoop "ecosystem" cloud service projects is introduced, and various cloud data-storage and data-processing technologies (e.g. "NoSQL" and "NewSQL" databases, graph databases, stream-processing systems, etc) are surveyed, with evaluation of their strengths and weaknesses. The unit closes with discussion of current research issues.

Intended Learning Outcomes

By the end of the unit students will be able to:

1. Explain the economic factors and economies of scale that have driven the development of cloud computing;

2. Compare and appropriately select among the various cloud computing services offered by major providers such as Amazon, Google, Microsoft, and Oracle, and have direct experience of initiating, running and managing, and closing remotely accessed computational resources via X-as-a-Service access models;

3. Demonstrate competence as a practitioner of cloud database programming with reference to the "NoSQL" approach (such as MongoDB, Cassandra, and CouchDB), to "NewSQL" cloud databases with relational functionality, and to graph databases such as Neo4J or Giraph).

4. Reflect on experience of small-group team-work using contemporary software development techniques such as “pair programming”.

5. Refer to at least one case-study of a contemporary successful company whose business model is dependent on cloud services and relate this success to their implementation and use of large-scale data engineering;

6. Demonstrate the combination and use of cloud computing technologies such as in-memory compute and stream-processing in high-performance and high-throughput applications; and

7. Identify and discuss current research issues in large-scale data engineering.

Teaching Information

Teaching will be delivered through a combination of synchronous and asynchronous sessions, including lectures, group work, practical activities and self-directed exercises.

Assessment Information

Coursework (100%)

Reading and References

  • Akidau, Tyler, Chernyak, Slava, and Lax, Reuven. Streaming Systems: The What, Where, When, and Howof Large-Scale Data Processing. O'Reilly, 2018.
  • Bengfort, Benjamin, and Kim, Jenny: Data Analytics with Hadoop. O'Reilly, 2016.
  • Karau, Holden. Learning Spark: Lightning-Fast Big Data Analysis. O'Reilly, 2015.
  • Kleppmann, Martin. Designing Data-Intensive Applications, O’Reilly, 2017.
  • Kunigk, Mark, Buss, Ian, Wilkinson, Paul, and George, Lars. Architecting Modern Data Platforms: AGuide to Enterprise Hadoop at Scale. O'Reilly, 2019.
  • Lakshmanan, Valliappa. Data Science on the Google Cloud Platform. O'Reilly, 2017.
  • Needham, Mark, and Hodler, Amy. Graph Algorithms: Practical Examples in Apache Spark andNeo4J. O'Reilly, 2019.
  • Perkins, Luc, Redmond, Eric, and Wilson, Jim. Seven Databases in Seven Weeks: A Guide to ModernDatabases and the NoSQL Movement. Second edition, O'Reilly, 2018.
  • White, Tom. Hadoop: The Definitive Guide. O'Reilly, 2015.
  • Wittig, Michael: Amazon Web Services in Action, Manning, 2018.
  • Piper, Ben, and Clinton, David: AWS Certified Cloud Practitioner Study Guide: CLF-C01 Exam, Sybex, 2019.

Feedback